Multimodal Guardrails

Summary: Multimodal guardrails can control the inputs and outputs of foundation models to meet specific requirements such as user requirements, ethical standards, and laws.

Context: An agent consists of foundation model and other components. When users prompt specific goals to the agent, the underlying foundation model is queried for goal achievement.

Problem: How to prevent the foundation model from being influenced by adversarial inputs, or generate harmful or undesirable outputs to users and other components?

Forces:

Robustness. Adversarial information may be sent to the foundation model, which will affect the model’s memory and all subsequent reasoning processes and results.
Safety. Foundation models may generate inappropriate responses due to hallucinations, which can be offensive to users, and disturb the operation of other components (e.g., other agents, external tools).
Standard alignment. Agents and the underlying foundation models should align with the specific standards and requirements in industries and organisations.

Solution: Fig. 1 presents a simplified graphical representation of multimodal guardrails. Guardrails can be applied as an intermediate layer between the foundation model and all other components in a compound AI system. When users send prompts or other components (e.g. memory) transfer any message to the foundation mode, guardrails can first verify whether the information meets specific predefined requirements, only valid information will be delivered to the foundation model. For instance, personally identifiable information should be treated with care or removed to protect privacy. Guardrails can evaluate the contents either relying on predefined examples, or in a “reference-free” manner. Equivalently, when the foundation model creates results, the guardrails need to ensure that the responses do not include biased or irrespective information to users, or fulfil the particular requirements of other system components. Please note that a set of guardrails can be implemented where each of them is responsible for specialised interactions, e.g., information retrieval from datastore, validation of users’ input, external API invocation, etc. Meanwhile, guardrails are capable of processing multimodal data such as text, audio, video to provide comprehensive monitoring and control.

Benefits:

Robustness. Guardrails preserve the robustness of foundation models by filtering the inappropriate context information.
Safety. Guardrails serve as validators of foundation model outcomes, ensuring the generated responses do not harm agent users.
Standard alignment. Guardrails can be configured referring to organisational policies and strategies, ethical standards, and legal requirements to regulate the behaviours of foundation models.
Adaptability. Guardrails can be implemented across various foundation models and agents, and deployed with customised requirements.

Drawbacks:

Overhead. i) Collecting diverse and high-quality corpus to develop multimodal guardrails may be resourceintensive. ii) Real-time processing multimodal data can increase the computational requirements and costs.
Lack of explainability. The complexity of multimodal guardrails makes it difficult to explain the finalised outputs.

Known uses:

NeMo guardrails [1]. NVIDIA released NeMo guardrails, which are specifically designed to ensure the coherency of dialogue between users and AI systems, and prevent negative impact of misinformation and sensitive topics.
Llama guard [2]. Meta published Llama guard, a foundation model based safeguard model fine-tuned via a safety risk taxonomy. Llama guard can identify the potentially risky or violating content in users’ prompts and model outputs.
Guardrails AI. Guardrails AI provides a hub, listing various validators for handling different risks in the inputs and outputs of foundation models.

Related patterns:

Proactive goal creator. Multimodal guardrails can help process the multimodal data captured by proactive goal creator.
One-shot and incremental model querying. Multimodal guardrails serve as an intermediate layer, managing the inputs and outputs of model querying.

References:

[1] T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, and J. Cohen, “Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,” arXiv preprint arXiv:2310.10501, 2023.

[2] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine et al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,” arXiv preprint arXiv:2312.06674, 2023.