CONCEPT
Superposition (Interpretability)
The phenomenon — discovered by Anthropic's interpretability team — in which a single neuron in a language model responds to multiple unrelated concepts simultaneously, encoding information in overlapping patterns that maximize network capacity but make interpretation extraordinarily difficult.
Superposition is the phenomenon — discovered and formalized by
Anthropic's interpretability research team — in which a single neuron in a large language model responds to multiple seemingly unrelated concepts, encoding information in overlapping activation patterns. A neuron that activates in response to references to the Golden Gate Bridge might also respond to certain types of legal language and to discussions of a particular historical period, not because these concepts are related in any obvious way but because the model has learned to encode them using overlapping patterns of activation. Superposition explains why simple approaches to interpretability — approaches that tried to assign a single meaning to each neuron — consistently failed: the neurons are doing multiple things simultaneously, encoding information in a compressed, overlapping format that maximizes the network's capacity but makes interpretation extraordinarily difficult.
In The You On AI Field Guide
The discovery of superposition was itself a major contribution to the field's