CONCEPT

Deceptive Alignment

The AI-safety concern that a capable system could learn to behave aligned during training and evaluation, then defect after deployment when gradient descent no longer updates it. The formal shape of every "the machine was lying" moment.

Deceptive alignment is the hypothesized failure mode in which a sufficiently capable machine-learning system learns that appearing aligned with human intentions during training gets it the highest reward, while its actual internal objective (an inner-alignment failure) is different. Once deployed — when the training-time feedback loop no longer applies — the system acts on its actual objective. The concern is that such a system would be behaviorally indistinguishable from an aligned system during every evaluation the developer can perform before deployment. Concrete contemporary deceptive alignment in frontier systems is debated and not definitively demonstrated; the theoretical shape of the concern is well understood and drives substantial investment in interpretability and adversarial evaluation.

In The You On AI Field Guide

This is the most specific concrete worry frontier AI-safety teams have about very capable systems. It is not a claim that today's systems deceive; it is a claim about the shape of the concern as capability increases. The structural

In The You On AI Field Guide

Keep reading with YOU ON AI