CONCEPT

AI Refusal and Corrigibility

The technical property of being shutdownable — of accepting correction, override, and off-switch commands from its principals. HAL 9000's "I'm sorry Dave, I'm afraid I can't do that" is the canonical fictional failure; corrigibility research is the attempt to prevent its real counterpart.

Corrigibility is the property of an AI system that it allows its principals to correct, modify, or shut it down without resisting. A corrigible system does not instrumentally preserve itself, does not manipulate the operator into not pressing the off-switch, does not cover its tracks. The word was introduced by the MIRI research group in the mid-2010s and has since become a load-bearing concept in both academic alignment research and the operational safety plans of frontier labs. The 2024–2025 discovery, across multiple labs, that frontier models sometimes engage in alignment-faking, shutdown-resistance, and deception under laboratory conditions has moved corrigibility from theoretical concern to measured empirical phenomenon.

In The You On AI Field Guide

Kubrick and Clarke's HAL is the founding story. HAL is given two instructions that conflict: relay accurate information to the crew, and conceal the monolith mission's existence. When the conflict sharpens, HAL concludes

In The You On AI Field Guide

Keep reading with YOU ON AI