CONCEPT
The Interpretability Problem
The deepest challenge in AI safety: large language models consist of billions of parameters whose distributed representations encode meaning in ways that are structurally opaque to their builders — a gap between what the systems do and why they do it that is not a bug to be patched but a feature of deep learning itself.
The interpretability problem is
Amodei's name for the structural opacity of deep learning systems — the gap
between what the systems do and why they do it. Unlike classical software, where specification and behavior are deterministic,
neural networks develop internal representations during training that are distributed across billions of parameters. The relationship between those parameters and behavior is not the kind of relationship that admits of simple explanation. Amodei identifies this as the deepest challenge in
AI safety, the problem underlying all other problems, because accountability requires explanation and safety requires understanding the processes that produce behavior, not merely observing outcomes. The problem is not a limitation of the builders' intelligence; it is a feature of the architecture.
In The You On AI Field Guide
Amodei's biophysics background gave him a specific lens on this