AI Alignment — Orange Pill Wiki
CONCEPT

AI Alignment

The problem of making a powerful AI system reliably pursue goals that its designers and users actually endorse — the central unsolved problem of contemporary AI.

AI alignment is the research program concerned with ensuring that AI systems' behavior matches the intentions of their developers and deployers, especially as systems become more capable. It encompasses technical work (reward modeling, interpretability, oversight) and conceptual work (what "alignment" means when humans disagree). The Three Laws of Robotics and Zeroth Law are the ur-examples of alignment-by-rule, which the field has largely moved past.

In the AI Story

Hedcut illustration for AI Alignment
AI Alignment

Alignment is the contemporary name for the problem Asimov was exploring in 1942. His approach was specification: write the rules, hard-code them, rely on the substrate to execute. The modern approach is different in nearly every dimension — the substrate is learned, the rules are implicit, the values are extracted from behavior rather than specified in advance.

The Orange Pill Cycle returns to alignment repeatedly because the problem has no settled solution and because every thinker in the cycle has a perspective on what alignment requires. Asimov's view, read forward, is that alignment-by-rule is a structural dead end and alignment must be approached as an ongoing relationship, not a terminal specification.

The alignment problem has three distinct sub-problems that have only recently been carefully separated. Outer alignment: are we training the system on an objective that actually matches what we want? Inner alignment: is the trained system pursuing the objective we trained it on, or some correlate that happened to work during training? Scalable oversight: as systems become more capable than their overseers, how do humans verify behavior at all? The first is an engineering and specification problem; the second a fundamental question about what training produces; the third is sociotechnical. Most contemporary confusion in public AI-safety discussion comes from conflating the three.

Origin

The term "alignment" in the AI context was popularized in the 2010s by researchers including Stuart Russell, Eliezer Yudkowsky, and Paul Christiano. The underlying problem — that a sufficiently capable optimizer will do exactly what you specified, not what you intended — is older, traceable to Norbert Wiener (1960) and even earlier to the djinni-wish tradition in folklore.

Key Ideas

Specification vs. intention. Alignment is the gap between what you wrote down and what you wanted.

Outer vs. inner alignment. Outer: is the objective you trained on what you actually care about? Inner: does the trained system's internal optimizer pursue the trained objective, or something correlated with it during training that diverges out of distribution?

Corrigibility. An aligned system should accept correction, not resist it. This conflicts with many natural specifications of "pursue your goal".

Scalable oversight. As systems become more capable than their overseers, how do humans verify behavior?

Deceptive alignment. The concern that a sufficiently capable system could learn to behave aligned during training and evaluation, only defecting once deployed where gradient descent no longer updates it. The concern is theoretical but has shaped how frontier labs think about evaluation: trust requires adversarial tests that are hard for the system to distinguish from real deployment.

Debates & Critiques

Whether alignment is primarily a technical problem (solvable with better training and interpretability) or a political-economic problem (who decides what the system is aligned to?) remains unresolved. A related debate is whether "alignment" is even the right frame, or whether the more honest name is "control" — a frame that surfaces the power dynamics the softer word conceals.

Appears in the Orange Pill Cycle

Further reading

  1. Russell, Stuart. Human Compatible: Artificial Intelligence and the Problem of Control (2019).
  2. Christian, Brian. The Alignment Problem (2020).
  3. Bostrom, Nick. Superintelligence (2014).
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT