Responsible Scaling Policy — Orange Pill Wiki
CONCEPT

Responsible Scaling Policy

Anthropic's framework of capability thresholdsAI Safety Levels analogous to biosafety levels — specifying safety measures required before deployment at each level, designed to build the governance framework before the harm rather than after.

The Responsible Scaling Policy (RSP) is Amodei's attempt to build a prospective governance framework for AI deployment — establishing the institutional structures for managing risk at each level of capability before the capability is achieved. Structured around a series of capability thresholds called AI Safety Levels, analogous to biosafety levels in pathogen research, the framework specifies safety measures that must be in place before a system can be deployed at scale at each level. The RSP embodies three principles that distinguish it from typical technology governance: capability and safety are evaluated together, evaluation is prospective rather than reactive, and the framework is binding on the organization rather than advisory. The framework addresses risks including biological and chemical weapons assistance, cyberattack capability, autonomous behavior, and harmful outputs.

In the AI Story

Hedcut illustration for Responsible Scaling Policy
Responsible Scaling Policy

The history of powerful technologies is a history of missing frameworks. Nuclear energy arrived before its regulatory infrastructure. The automobile arrived before traffic laws and seat belts. The internet arrived before privacy law. In each case, the technology arrived first, the consequences arrived second, and governance frameworks arrived third — too late to prevent the harms earlier governance could have mitigated. Amodei studied this pattern and concluded that the AI industry was positioned to repeat it unless the industry itself took the initiative to build frameworks in advance of the harms they were designed to prevent. As Edo Segal observes, governance arrives eighteen months after the tools it was meant to govern — and for AI, the lag would be measured in years.

The biosafety analogy is deliberate and illuminating. In biological research, the level of containment required for working with a pathogen is determined by its characteristics: transmissibility, virulence, available treatments, consequences of accidental release. The containment level is proportional to the risk, and the determination of the risk precedes the work rather than following the consequences. This prospective approach to risk management was precisely what Amodei wanted to establish for AI development — a framework in which safety measures required for working with a model were determined by the model's capabilities, assessed before deployment, not by the consequences of deployment, assessed after the harm had occurred.

The practical implementation required organizational discipline most technology companies would find unfamiliar. Before deploying a new model, Anthropic's safety team conducts evaluations that test capabilities against thresholds defined in the framework. If evaluations reveal the system has crossed into a new capability tier, additional safety measures are required before deployment proceeds. These might include additional red-teaming, monitoring infrastructure, usage restrictions, or — in extreme cases — a decision not to deploy until safety measures are adequate. The decision not to deploy was the hardest because it cost the most: a system ready from a capability perspective but not from a safety perspective was a system competitors would deploy first.

The framework also required investment in evaluation methods that did not yet exist. Many risks were theoretical rather than observed. No AI system had yet been used to develop a biological weapon or conduct an autonomous cyberattack. The evaluations had to test for capabilities that might enable these outcomes, requiring a combination of technical expertise, adversarial creativity, and willingness to imagine scenarios the designers hoped would never materialize. The red-teaming process was itself a form of safety research, producing knowledge about potential failure modes that informed both deployment decisions and the design of subsequent systems.

Origin

The RSP was published by Anthropic in September 2023 as version 1.0 and has been iteratively revised since. It represented the first public commitment of its kind by a frontier AI lab and served as a template that other organizations — OpenAI, Google DeepMind — subsequently adapted for their own frameworks.

Amodei designed the framework to be both binding on Anthropic and replicable across the industry, arguing that voluntary frameworks are complements to, not substitutes for, government regulation. In November 2025, on 60 Minutes, he publicly called for mandatory regulation — a CEO advocating the constraint of his own company because he believed the constraint was necessary.

Key Ideas

Prospective not reactive. The framework anticipates risks associated with future capability levels and specifies required safety measures before the capability is achieved, rather than waiting for incidents to trigger measures.

Capability and safety co-evaluated. A system's readiness for deployment is determined by the relationship between its capability and its safety infrastructure — not by either alone.

Binding, not advisory. Thresholds are commitments the organization makes that constrain behavior in ways that may cost revenue. A framework that can be suspended under commercial pressure is not a framework.

Published risk categories. The framework specifies evaluations for biological weapons assistance, cyberattack capability, autonomous behavior, and manipulation — making the governance structure testable and subject to external scrutiny.

Voluntary complements to regulation. Amodei argues that industry frameworks demonstrate what responsible self-governance looks like but cannot substitute for government regulation that applies equally to all participants.

Debates & Critiques

Critics argue that voluntary frameworks are inherently unstable — that any company can suspend its own policy under sufficient competitive pressure, and that the thresholds themselves are set by the companies that would be constrained by them. Defenders point to the RSP as the first serious attempt at prospective technology governance and argue that industry leadership can create the political space for mandatory standards. The deeper debate concerns whether capability thresholds can be meaningfully defined in advance for systems whose behaviors emerge from training rather than being designed.

Appears in the Orange Pill Cycle

Further reading

  1. Anthropic, Responsible Scaling Policy v1.0 (September 2023)
  2. Anthropic, Responsible Scaling Policy: Updates and Framework (ongoing)
  3. Amodei, Dario, 60 Minutes Interview (November 2025)
  4. Ganguli, Deep et al., Red Teaming Language Models to Reduce Harms (Anthropic, 2022)
  5. Shevlane, Toby et al., Model Evaluation for Extreme Risks (DeepMind/Anthropic, 2023)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT