A Layered Approach to AI Safety: From Core Model to User Experience

Jun 25

The deployment of Large Language Models into real-world applications requires a robust, multi-layered safety architecture. A comprehensive safety strategy is not a single feature but a defense-in-depth system, ensuring that protections are integrated at every stage, from the model's initial training to the end-user's interaction. This framework can be broken down into three fundamental layers: the Core Model, the Configurable Safety Layer, and the Application Layer.

Core Model

The first layer of safety is built directly into the core model itself during its development, training and fine-tuning. This process is designed to make helpful and harmless behavior the model's default state. It involves several key stages:

Pre-training Data Curation: Before training begins, enormous datasets of text (and images) are processed and filtered to remove unsafe content and reduce or modify undesirable patterns. This initial filtering is the first and broadest step in shaping the model's foundational knowledge.
Instruction Fine-Tuning and RLHF: After pre-training, the model undergoes extensive fine-tuning. A key part of this is Reinforcement Learning from Human Feedback (RLHF). During this stage, human reviewers rank different model responses based on safety and helpfulness. This preference data is used to train a "reward model" that learns to score outputs based on human values. The LLM is then optimized to generate responses that maximize this reward score, effectively steering its behavior towards safe and useful outputs. This is where the model learns to handle adversarial inputs and perform "safe refusals", meaning declining harmful requests, representing the learned optimal behavior.
Red Teaming and Model Evaluations: Throughout its development, the model is subjected to rigorous adversarial testing, or "red teaming." Specialized teams and automated systems attempt to "jailbreak" the model by finding prompts that elicit harmful or unintended responses. The data from these tests is used to identify vulnerabilities and further refine the model's safety training. The model is also evaluated against standardized internal and external benchmarks for risks and content harms.

This first layer creates a model with strong, generalized safety capabilities

Recommended Reads

Safety System

While the core model has built-in safety, different applications have vastly different risk profiles. The second layer is a flexible, configurable set of tools that sits on top of the core model, allowing developers to tailor safety policies to their specific use case.

A safety system can be integrated into platform services that serve foundational models and run automatically alongside the core model or function as a separate service or API that intercepts both the user's prompt (input) and the model's generated response (output) before it reaches the end-user. It provides developers with granular control over:

Risk Categories and Thresholds: Developers can configure the sensitivity for various categories of harm (e.g., hate, violence, self-harm). For a given category, they can set a risk threshold (e.g., low, medium, high) that aligns with their application's tolerance. For example, a social media application might have a strict safety system policy for violence, while a tool for fictional writing might have a higher tolerance for content that would otherwise be flagged.

Response Behaviors: Based on the risk level detected, developers can define a specific action to be taken. Common configurations include:

Hard Block: The response is blocked entirely, and a pre-defined, safe message is shown to the user instead. This is typically used for high-severity violations.
Flag for Review: The response is sent to the user, but the interaction is flagged in a backend system. This is useful for monitoring borderline cases and incident response.
Human-in-the-Loop (HITL): In high-stakes environments, such as healthcare or finance, a potentially risky output can be automatically routed to a human expert for review and approval before it is sent to the end-user. This provides a critical real-time check.

This configurable layer is essential for responsible deployment, as it moves safety from a static model feature to a dynamic, adaptable system that empowers developers to meet their use case specific compliance, legal, and safety requirements.

Application Layer

The final layer of safety resides in the application itself: the user interface and user experience through which the end-user interacts with AI. Responsible design includes communicating the capabilities and limitations to the user transparently.

Key elements of a safe application layer include:

Clear Disclosure: The interface should always make it explicit that the user is interacting with an AI system. This prevents deception and helps set appropriate user expectations.
Source Attribution and Fact-Checking: When a model provides factual information, a safe UX might include citations, links to sources, or prompts encouraging the user to verify critical information.
Explanation of Limitations: Using disclaimers and contextual guidance to inform users about the model's limitations (e.g., "This AI may produce inaccurate information" or "This is not a substitute for professional medical advice") is a crucial part of mitigating real-world harm.
Feedback Mechanisms: Providing users with simple, intuitive tools to report inaccurate, biased, or otherwise harmful outputs is vital. This user-generated feedback is an invaluable source of data for improving the safety of all layers over time.

By combining baked-in model safety, a highly configurable policy layer, and a transparent user experience, organizations can build a robust, defense-in-depth safety architecture. This layered approach is fundamental to deploying powerful and trustworthy AI systems and earning user trust.

Suzana Ilic

A Layered Approach to AI Safety: From Core Model to User Experience

Core Model

Safety System

Application Layer

Interactive Tools: Deep Learning, Data, Interpretability, Math

Suzana Ilic