To understand what large language models are requires understanding what class of system they instantiate. Parallel Distributed Processing is that class. The cycle that began with [YOU] on AI aims for clear sight of the machines; PDP is the framework that makes clear sight possible at the level of mechanism. It explains why these systems generalize from examples without being told how, why they fail in graded, content-sensitive ways rather than brittle binary ones, why their knowledge cannot be easily read out or explained, and why the emergent capabilities that have surprised the field were, in retrospect, of the right kind—exactly what a distributed learning system should do when scaled sufficiently.
PDP also explains the limits that the cycle most needs to name. A distributed network stores knowledge as connection weights that cannot be easily updated from single encounters; this is the mechanism behind the complementary learning systems problem. A network whose representations float free of any grounded world produces exactly the pattern of confident fluency without situational understanding that characterizes current large models. Knowing that these limits are structural consequences of the PDP architecture, not bugs to be patched, changes the expectations one should have.
The immediate precursor to PDP was the Interactive Activation model that McClelland and Rumelhart published in 1981, which showed that the word-superiority effect in visual perception—letters identified more accurately inside words than alone—could be explained by a network in which activation flows both upward (from features to letters to words) and downward (from words back to letters), so that knowing the word helps identify its letters. The model demonstrated the central PDP principle in miniature: structured, intelligent behavior emerging from the mutual constraint of many interacting units, with no rules written in.
The 1986 volumes extended this to cognition broadly. Their central technical contribution was the popularization and systematic application of the backpropagation of error algorithm—previously described by Rumelhart, Hinton, and Williams in their landmark chapter—which gave networks with hidden layers a way to assign responsibility for errors backward through the layers and adjust every connection accordingly. With hidden layers and a training algorithm, the limitations Minsky and Papert had proven for single-layer networks fell away. The 1986 past-tense model, showing that a single network could learn both regular and irregular English verb forms and reproduce the U-shaped developmental trajectory of children, without any rule for adding -ed, was the most provocative demonstration: the thing everyone knew required a symbolic rule turned out not to require one.
The PDP volumes became one of the most cited works in the history of cognitive science. The line from them to the transformer architecture of modern AI is not a metaphor but genealogy: the same principle, the same training algorithm, vastly more layers and parameters and data.
Distributed representation. In a PDP system, a concept is not a symbol at an address but a pattern of activation spread across many units. Two similar concepts activate overlapping patterns; their similarity is encoded in the geometry of the space. This is the direct ancestor of the embedding: the vector that locates an item in a high-dimensional learned space, whose structure encodes the relations among everything the system knows. The embedding is not an engineering choice; it is distributed representation, implemented at industrial scale.
Emergence without design. The central anti-symbolic claim: you do not have to build in the structure of the domain. Build the learning mechanism—units, connections, backpropagation—expose it to enough data, and the structure of the domain emerges in the weights. Grammar need not be programmed; concepts need not be defined; categories need not be specified. The regularities appear because they are in the data, and a distributed network will find them. The emergent capabilities of modern AI systems are this principle operating at a scale the PDP group could not have demonstrated but entirely predicted in kind.
Graceful degradation. A symbolic system breaks when a rule or a stored symbol is lost; a PDP system degrades gracefully because knowledge is spread across many weights and the loss of a few affects all outputs mildly rather than destroying any particular one. This maps onto clinical observations: the pattern of cognitive decline in dementia, the specific errors of brain-damaged patients, the graded rather than all-or-nothing character of the impairment. It is also why large AI systems are robust in some ways and brittle in others—they degrade gracefully within their distribution but fail sharply at its edges.
The interpretability problem is structural. Because knowledge in a PDP system is distributed across millions or billions of weights, there is no single place to look to find what the system knows about any particular thing. The knowledge is implicit, embedded in the geometry of the weight space, and cannot be extracted without losing the very distributional properties that make it work. This is the origin of the interpretability problem in AI: not a temporary engineering gap but a structural consequence of distributed representation. McClelland named the problem in 1986; the field is still working on it.