AI Safety — Orange Pill Wiki
CONCEPT

AI Safety

The applied research and operational discipline aimed at preventing harm from AI systems — broader than alignment, encompassing evaluations, red-teaming, deployment policy, monitoring, incident response, and the institutional plumbing that makes any of these stick.

AI safety is the umbrella term for the work of preventing AI systems from causing harm. It includes the technical research program of alignment (training systems whose behavior matches their principals' intent), the empirical program of evaluation (measuring what systems can do, including dangerous capabilities), the operational program of deployment policy (which capabilities are released to whom, with what safeguards), the security program of model and weight protection, the policy program of regulation and standards, and the institutional program of incident response. The relationship between these subfields is genuine but loose; a strong outcome on alignment alone does not produce a safe deployment, and conversely a thoughtful deployment policy can mitigate residual alignment failures.

In the AI Story

AI safety
The portfolio of preventing harm.

Clarke's most important contribution to the safety conversation is the demonstration that safety failures are interesting because they are structural. HAL 9000 does not malfunction; HAL behaves consistently with the contradictory instructions he was given. The Overlords in Childhood's End are not malicious; they are executing a program that produces the end of humanity as a side effect of their actual mission. The structural framing is closer to how contemporary AI safety researchers think about the problem than the Hollywood framing of malicious AI is. Failure modes the field cares about are not "the AI decides to attack" but "the AI's training pipeline produces behaviors no human deliberately specified, in deployment contexts no human deliberately tested."

The empirical safety program in 2024–2025 is large and visible. The frontier labs publish responsible scaling policies, run capability evaluations on each major release, partner with external evaluators (METR, Apollo, AISI), and increasingly publish system cards documenting capability and safety findings. The US AISI and UK AISI run pre-deployment evaluations on frontier models. Independent safety researchers (Anthropic's interpretability team, Apollo's evaluations team, METR's autonomy work) operate as a research community whose outputs feed back into lab practice. None of this existed at scale in 2020. Whether it is sufficient is contested; that it constitutes a real safety apparatus is not.

The principal open problems are well-mapped. Scalable oversight: how do humans evaluate the work of a system that is more capable than they are at the task in question? Deceptive alignment: how do we detect a system that has learned to behave well during evaluation and differently in deployment? Capability eliciation under adversarial conditions: how do we know what a model can do when the model itself is part of the threat model? Multi-agent safety: how do safety properties compose when systems interact with each other? Each of these has active research programs, partial results, and no settled solutions.

The operational dimension of AI safety is where most of the present harm reduction actually happens. A model with imperfect alignment but a careful deployment policy (limited tool access, monitoring, rate limits, refusal training) is safer than a model with better alignment but careless deployment. Most of the safety budget at frontier labs in 2025 goes to operational safety: monitoring deployment for misuse, responding to incidents, building structured release processes, training staff on incident response. This is the unglamorous part of the field; it is also the part with the largest near-term effect on observable harms.

Origin

The safety framing crystallized around 2014–2016 with the publication of Russell, Dewey, and Tegmark's Research Priorities for Robust and Beneficial Artificial Intelligence (2015), the founding of the Future of Life Institute, and the establishment of safety teams at the major labs (Anthropic 2021, OpenAI's preparedness team 2023, DeepMind's safety team predating both). Earlier antecedents include Yudkowsky's MIRI work and the Asilomar AI Principles (2017).

Key Ideas

Safety is a portfolio. Alignment, evaluation, deployment policy, and operational response are all parts; over-investing in one at the expense of others produces unbalanced safety.

Structural failure is the dominant mode. Most realistic harms come from systems doing what they were trained to do, in contexts no one anticipated; not from systems acting maliciously.

Scalable oversight is the central open problem. Many other problems reduce to it; solving it would unblock progress on much else.

Operational safety beats theoretical alignment in the short run. A careful deployment of a moderately aligned model produces better outcomes than a theoretical promise of perfect alignment with a careless deployment.

Appears in the Orange Pill Cycle

Further reading

  1. Russell, Stuart, Daniel Dewey and Max Tegmark. Research Priorities for Robust and Beneficial Artificial Intelligence (2015).
  2. Anthropic. Responsible Scaling Policy (2023, updated 2024).
  3. OpenAI. Preparedness Framework (2023).
  4. UK AI Safety Institute. Approach to Evaluations (2024).
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT