Superposition is the phenomenon — discovered and formalized by Anthropic's interpretability research team — in which a single neuron in a large language model responds to multiple seemingly unrelated concepts, encoding information in overlapping activation patterns. A neuron that activates in response to references to the Golden Gate Bridge might also respond to certain types of legal language and to discussions of a particular historical period, not because these concepts are related in any obvious way but because the model has learned to encode them using overlapping patterns of activation. Superposition explains why simple approaches to interpretability — approaches that tried to assign a single meaning to each neuron — consistently failed: the neurons are doing multiple things simultaneously, encoding information in a compressed, overlapping format that maximizes the network's capacity but makes interpretation extraordinarily difficult.
The discovery of superposition was itself a major contribution to the field's understanding of how neural networks represent information. It explained observations that had puzzled researchers for years — why attempts to label neurons with single concepts produced inconsistent results, why probing individual neurons seemed to reveal contradictory information, why straightforward approaches to mechanistic interpretability kept hitting walls. The puzzle dissolved once researchers recognized that neurons were not designed for human-interpretable concept labeling but for efficient information encoding, and the efficient encoding involved packing multiple concepts into overlapping representational space.
The technical explanation involves the relationship between the number of concepts a network needs to represent and the number of neurons available to represent them. When the concept space exceeds the neuron space — which it typically does for language models trained on the full diversity of human text — the network cannot dedicate a single neuron to each concept. Instead, it must pack multiple concepts into shared representational space, using different combinations of neurons to distinguish between them. This compression is mathematically efficient but interpretively opaque.
The Anthropic team developed techniques to disentangle these overlapping representations, identifying what they called features — directions in the model's activation space that corresponded to interpretable concepts even when no individual neuron corresponded to them. The feature-level analysis allowed researchers to identify, for example, a 'Golden Gate Bridge feature' that existed as a specific pattern of activation across many neurons rather than in any single neuron. This approach represented genuine progress on the interpretability problem, making visible structures that had been hidden by the superposition phenomenon.
The concept has implications beyond its technical significance. Superposition demonstrates that the gap between what neural networks do and why they do it is not a matter of insufficient effort by researchers but a structural feature of how distributed information processing works. The gap is not a problem that better software engineering will solve. It requires fundamentally different analytical approaches, and those approaches are expensive, technically demanding, and unlikely to produce near-term commercial returns. Amodei's emphasis on interpretability as the most important and most underfunded area of AI safety research reflects the recognition that superposition and related phenomena are not incidental obstacles but structural features requiring sustained investment.
The superposition phenomenon was formalized in Anthropic's 2022 paper 'Toy Models of Superposition' by Nelson Elhage, Tristan Hume, and colleagues. The paper built on earlier observations about polysemanticity — the observation that individual neurons appeared to respond to multiple concepts — and provided a theoretical framework explaining why this pattern emerged from the training process.
Subsequent work, including the 2024 paper 'Scaling Monosemanticity,' demonstrated that feature-level analysis could successfully disentangle superposed representations at the scale of production language models, making interpretability research applicable to the models actually being deployed rather than only to toy examples.
Single neurons, multiple concepts. A neuron that responds to the Golden Gate Bridge may also respond to legal language and historical periods through overlapping activation patterns.
Structural, not accidental. Superposition emerges from the mathematical efficiency of packing more concepts than neurons into shared representational space.
Polysemanticity explained. The phenomenon explains why earlier interpretability approaches assigning single meanings to neurons consistently failed.
Features as alternative unit. Feature-level analysis identifies interpretable concepts as patterns across many neurons rather than in single neurons.
Interpretability gap is structural. Superposition demonstrates that the gap between behavior and explanation is not a matter of insufficient effort but a feature of distributed processing.
Current research debates concern whether superposition is a feature that will scale with model size (creating ever more compressed representations) or whether there are limits beyond which networks revert to more interpretable encodings. A related debate concerns whether sparse autoencoders — the primary tool for extracting features from superposed representations — scale to frontier models or whether new analytical approaches are required.