CONCEPT
Constitutional AI
Anthropic's alignment approach that trains models to evaluate their own outputs against a set of written principles — replacing the implicit, averaged preferences of human evaluators with explicit, legible values embedded in the training process itself.
Constitutional AI is the alignment methodology developed by
Amodei's team at
Anthropic as a structural response to the limitations of
reinforcement learning from human feedback. Rather than relying exclusively on human evaluators to judge model outputs, the approach gives the model a written constitution — principles expressed in natural language — and trains the model to evaluate its own outputs against those principles. The constitution is not a filter applied after generation but a set of values embedded in training itself,
shaping how the model learns to respond at the level of its fundamental operation. Principles include choosing the most helpful response while being least harmful, being honest, and supporting human autonomy. The approach addresses three structural problems with standard RLHF: scalability, coherence, and transparency.
In The You On AI Field Guide
The standard approach to alignment — reinforcement learning from human feedback — relied on human evaluators judging outputs and providing feedback that shaped subsequent behavior.