Amodei's biophysics background gave him a specific lens on this problem. In neuroscience, the same challenge had existed for decades: understanding how the distributed activity of billions of neurons gives rise to coherent thought. Neuroscientists made progress by studying circuits — groups of neurons that work together to perform specific functions. Amodei brought this perspective to artificial neural networks, and the perspective shaped the direction of Anthropic's interpretability research in ways that distinguished it from work at other labs.
One early line of work examined individual neurons, attempting to identify neurons that responded to specific concepts. This produced limited success because representation in large language models is fundamentally distributed. A single neuron might activate in response to multiple seemingly unrelated concepts — a phenomenon the team called superposition. A neuron responding to references to the Golden Gate Bridge might also respond to certain legal language and discussions of a historical period, not because these concepts were related but because the model had learned to encode them using overlapping patterns. The discovery of superposition was itself a major contribution, explaining why simple interpretability approaches consistently failed.
The team developed techniques to disentangle overlapping representations, identifying what they called features — directions in the model's activation space corresponding to interpretable concepts. This represented genuine scientific progress, published for the broader research community consistent with Amodei's commitment to treating safety research as a public good. But Amodei was candid about the gap between what interpretability could explain and what the models could do. The gap was not narrowing. In some respects it was widening, because each advance in capability opened new behavioral territories that interpretability research had not yet explored.
The interpretability problem has implications beyond the technical. A system whose behavior cannot be explained is a system whose behavior cannot be held accountable. Accountability requires explanation — the ability to identify why a system produced a specific output and whether the process was consistent with its standards. When a system operates as a black box, accountability operates on the level of outcomes rather than processes, which means harmful processes can persist as long as they happen not to produce visibly harmful outcomes. Segal's example from You On AI — Claude producing an elegant but incorrect passage about Deleuze — is the interpretability problem expressed at paragraph scale.
Amodei established the interpretability research program at Anthropic in 2021 as a founding priority, building a team led by Chris Olah that had previously worked on circuit-level understanding of vision networks at OpenAI. The team's work on superposition, features, and mechanistic interpretability became foundational to the field.
In a 2024 essay, Amodei described interpretability as the most important and most underfunded area of AI safety research — simultaneously a statement about the field's priorities and a critique of them. The candor was itself a safety practice: a leader claiming interpretability was on track to fully explain model behavior would be providing false assurance more dangerous than honest uncertainty.
Structural opacity, not engineering failure. The gap between what deep learning systems do and why they do it is a feature of distributed information processing, not a bug that better engineering can patch.
Superposition as fundamental obstacle. A single neuron encodes multiple concepts through overlapping activation patterns, which is why simple approaches assigning single meanings to neurons consistently fail.
Accountability requires explanation. Behavioral measures provide some safety assurance but cannot substitute for understanding internal processes. Track-record-based trust is adequate for low-stakes decisions and inadequate for high-stakes ones.
The gap is widening, not narrowing. Each advance in capability opens new behavioral territories faster than interpretability methods can map them. Humility about this is itself a safety practice.
Systemic effects amplify the problem. When AI systems deploy at scale, aggregate effects shape the information environment in ways current interpretability cannot analyze. Bias that is invisible per-interaction becomes consequential across millions.
The central debate is whether interpretability will ever fully close the gap with capability or whether the two are structurally destined to diverge. Optimists point to rapid progress in mechanistic interpretability; skeptics argue that understanding a system of billions of parameters may be fundamentally beyond what human cognition can achieve unaided. A related debate concerns whether AI systems should be withheld from consequential domains until interpretability is complete — Amodei rejects this absolutism, favoring graduated deployment proportional to achieved understanding.