The cycle insists on seeing the AI transition clearly, without the narcotic of hype or the paralysis of dismissal. Rumelhart is where that clarity begins, because he supplies the conceptual vocabulary the public conversation most conspicuously lacks. The loud debate about whether machines “really understand” is, at bottom, a debate about what Rumelhart called learned representations: whether the internal features a trained network uses to process its inputs—features no human designed, never specified, found by gradient descent—constitute something that deserves to be called understanding. He did not resolve this question. He made it the right question, by showing how much of what looks like rule-following is better explained as the smooth behavior of a system that learned statistical regularities and never represented a rule at all.
His past-tense model is the single most instructive historical episode for thinking clearly about current AI. A small network trained to map present-tense verb sounds to past-tense sounds, with no rule and no list, reproduced one of the most famous signatures of human language acquisition: the U-shaped developmental curve, where a child first gets irregular pasts right, then over-regularizes to regular forms, then sorts it out. The critical lesson is not that the network won the argument but that it dissolved an inference everyone had been making: from rule-like behavior to an underlying rule. A system can produce systematically rule-conforming output, including the characteristic errors of rule over-application, with nothing inside it that is a rule. That is a permanent result. And it is exactly the question that must be asked of large language models today: are they following rules, or are they doing, vastly amplified, exactly what the past-tense network did?
The deepest thing Rumelhart teaches the cycle is a discipline of honesty about what we have built and what we do not know about it. Because everything these systems can do was learned by the same blind procedure of error reduction, the system’s values, blind spots, and failure modes are all functions of what it was trained on and what error it was told to minimize—and nothing else. Backpropagation does not know bias from signal; it descends whatever error surface it is given. The opacity of neural networks is not a bug; it is the direct, unavoidable consequence of the fact that representations are learned rather than designed, which is the same fact that makes them powerful. Rumelhart bought capability with comprehension, and he knew it. The hardest thing he leaves us—harder even than the question of machine consciousness—is the honest admission that we are responsible for systems whose internal workings we did not design and cannot fully read.
David Everett Rumelhart (1942–2011) was born in Mitchell, South Dakota, and trained as a mathematical psychologist—a field that applied formal methods to psychological questions—earning a BA in psychology and mathematics from the University of South Dakota and a PhD in mathematical psychology from Stanford. He spent two decades at the University of California, San Diego, before returning to Stanford in 1987. His colleagues remembered him as “the quietest, most unassuming powerhouse of intellect”—reserved to the point of near-invisibility at conferences, yet possessed of a mind that reorganized a field. He was honored with a MacArthur Fellowship and election to the National Academy of Sciences.
The 1986 paper he co-authored with Geoffrey Hinton and Ronald Williams, “Learning representations by back-propagating errors,” did not invent the mathematics of backpropagation—the underlying technique had been derived independently several times—but it did something different and, for the history of AI, decisive: it applied the technique to networks of neuron-like units, showed empirically that it actually worked to discover useful internal representations, and embedded it in a theory of cognition that made people care. In the same year, the two Parallel Distributed Processing volumes appeared from MIT Press, presenting connectionism as a full-scale alternative to the symbolic paradigm. Both publications were produced by a man who, by the mid-1990s, was already showing signs of the disease that would end his career and, in 2011, his life. The Rumelhart Prize, cognitive science’s highest honor, bears his name.
His intellectual lineage ran against the mainstreams of his era. When connectionism challenged the symbolic AI establishment in 1986, serious people called it a naive detour: the brain was a messy biological accident, and intelligent systems should be built from the abstract logic of intelligence in clean symbolic form. Rumelhart disagreed with a patience that was almost cruel in its consistency: he built the strongest possible version of the connectionist case, conceded the technical points his critics got right, and held firm on the conceptual core. He turned out to be right—not in every particular, but in the central bet. The brain’s style of computation, extracted at the right level of abstraction, proved to be the foundation for the most powerful AI ever built. He did not live to see it confirmed.
Backpropagation. The algorithm that trains every modern deep network runs a forward pass to make a prediction, computes an error against the target, and then uses the chain rule of calculus to propagate the error backward through every layer, adjusting each weight by a small amount in the direction that would have reduced the error. Repeat millions of times over many examples. The result is a network that has learned to minimize error on its training task, with no human specifying what each weight should represent. Because everything is learned by this single blind procedure, the system’s capabilities and its failure modes are both functions of the data and objective it was given. The indifference of the gradient to the destination is both its power and its hazard.
Parallel Distributed Processing. The PDP framework argues that intelligence arises from many simple units operating in parallel, with knowledge encoded in the strengths of their connections rather than in symbols or rules, and with computation proceeding as a soft, statistical settling toward an answer rather than a deductive derivation. These principles are now the literal engineering assumptions of modern AI. The PDP volumes won as engineering far more completely than they won as a theory of the human mind: we have connectionist systems of staggering capability, and whether the human mind is, in its deep structure, a PDP machine remains genuinely open. But the paradigm that the symbol-processing tradition called a romantic detour became the only game in town.
Learned Representations. The title of the 1986 paper—“Learning representations by back-propagating errors”—names what Rumelhart considered the central achievement. The hidden units of a trained network come to represent important features of the task domain, but nobody decided what those features would be. The network invents its own vocabulary under the pressure of reducing error. This is the precise origin of what we now call end-to-end deep learning: the practice of letting a single network learn the entire mapping from raw input to desired output, discovering its own representations without human feature engineering. The power of this approach is vast and its consequences for interpretability are severe: the representations are not labeled, their content was never specified by anyone, and they must be reverse-engineered after the fact.
The Past-Tense Model and the Rules vs. Statistics War. Rumelhart’s 1986 past-tense model showed that a single connectionist network could reproduce both regular and irregular English pasts, including the characteristic U-shaped developmental curve, without containing any rule. This dissolved the inference from rule-like behavior to underlying rule. Steven Pinker and Alan Prince contested the specific model successfully: its representation of word sounds was crude, its training regime artificial, its errors unlike any child’s. But the deeper point survived: you cannot infer the presence of a rule from behavior that conforms to one. That result applies to every language model today. Whether they reason or emit reasoning-shaped text produced by statistical pattern is the question Rumelhart’s verbs force, and it does not have a confident answer.
Graceful Degradation. A distributed system fails differently from a conventional computer. Damage some units, corrupt some weights, present unusual input, and the network does not halt or produce garbage; it gets a little worse, its answers blurrier and less confident but mostly appropriate. This fault tolerance is both the great virtue and the great hazard of connectionist systems. The virtue: the robustness that makes neural networks practical to deploy at scale, to prune, to quantize without collapse. The hazard: a system that degrades gracefully does not announce its failures. It produces fluent, confident, plausible output even as that output silently drifts away from correctness. The hallucination problem is not an incidental defect; it is the architectural shadow side of distributed representation, the cost of encoding knowledge as geometry rather than discrete fact.