CONCEPT

Backpropagation

The universal training algorithm of modern artificial intelligence—the procedure that sends an error signal backward through every layer of a neural network and adjusts each connection by a hair in the direction that would have reduced the mistake, repeated until the network learns.

Backpropagation is the closest thing modern artificial intelligence has to a first principle. Every deployed large neural network—every language model, every image classifier, every speech system, every game-playing agent that learned its skill—was trained by computing an error and propagating it backward through the network to adjust the weights. The procedure was made practical for networks of neuron-like units by David Rumelhart, Geoffrey Hinton, and Ronald Williams in a 1986 paper titled “Learning representations by back-propagating errors.” Its mathematics were the chain rule of calculus, old by 1986; its decisive contribution was demonstrating that the procedure actually worked to discover useful internal representations in hidden layers that no human had specified. Because everything a network can do was learned by this single blind procedure of error reduction, the intelligence of the system is real in its effects and blind in its origin—it knows nothing but the gradient of an error it is told to reduce, and it will descend whatever error surface it is handed toward whatever minimum that surface contains, with no regard for whether the destination is one anyone wanted. Understanding backpropagation is understanding large language models at their root: they were not authored; they were trained, and the difference is the whole of the present moment in AI.

In the [YOU] on AI Field Guide

The cycle’s invitation to see the AI transition clearly begins with backpropagation, because it reorganizes the entire question of what these systems are. The public imagination fills the space where this fact should be with vaguer notions: the system was programmed, or fed rules, or assembled its intelligence from curated knowledge. It was not. The loop is: forward pass, error, backward pass, weight update. Repeat billions of times. The result is not designed; it is grown under selective pressure toward the reduction of a loss. This is why the makers of frontier models cannot fully say how their own systems work, and why they are not being coy when they admit it. They built the loss function and the training loop; they did not build, and could not have written by hand, the trillions of learned parameters that result.

Backpropagation also reorganizes the question of AI safety and alignment. The system’s values, blind spots, and failure modes are all functions of what it was trained on and what error it was told to minimize—and nothing else. The algorithm does not know bias from signal; it knows only the gradient. If the data encode human prejudice, the gradient learns the prejudice, because the gradient descends the error surface it is given. The entire discipline of AI alignment is, at bottom, the problem of choosing data and objectives such that the inevitable result of running backpropagation is a system we can live with. That problem is not solved, and backpropagation, by being indifferent to destinations, will not solve it for us.

The reason backpropagation matters to the cycle is not historical. It is because understanding it changes what one expects of these systems and how one holds them accountable. A system grown by error reduction is accountable to its training conditions—the data, the objective, the choices made by the people who set them up. Those choices are human choices, and they are where responsibility lives. Rumelhart’s algorithm does not relieve us of moral responsibility by being mechanical. It returns that responsibility to the humans who specified what counted as error.

Origin

The underlying mathematics—reverse-mode automatic differentiation applied to a chain of functions—was derived independently several times before 1986, by Seppo Linnainmaa in 1970 and Paul Werbos in his 1974 doctoral thesis, among others. What Rumelhart, Hinton, and Williams did was different and, for the history of AI, decisive: they applied it to networks of neuron-like units with nonlinear activation functions, demonstrated empirically that it worked to discover interpretable internal representations in layers no human had labeled, and framed the result in a theory of cognition that made cognitive scientists and AI researchers alike take it seriously.

The wall the algorithm broke was the hidden-unit problem. Single-layer perceptrons from the 1960s could be trained by a simple learning rule, but they were too weak to do interesting things; the moment you added hidden layers with the power to represent complex functions, you had no way to tell the hidden units what they should have done. Backpropagation solved this by computing, for each hidden unit, exactly how much it had contributed to the output error and adjusting it accordingly. The contribution is computed by the chain rule—multiplying derivatives layer by layer backward from the output—and the result is a gradient that tells every parameter, including those buried many layers from the output, which direction to move to reduce the mistake.

From the 1986 paper to a 2025 frontier model, the continuity is exact in structure and differs only in magnitude. A current large model has hundreds of billions of parameters and is trained on trillions of tokens. The loop is identical. The differences are engineering: better optimizers, initialization schemes, normalization techniques, and architectures like the transformer that allow gradients to flow more cleanly through very deep networks. None of these touch the core. Peel back every refinement and the thing doing the learning is backpropagation. The revolution was not a new idea about how to learn. It was Rumelhart’s idea, finally given enough data and computation to show what it could do.

Key Ideas

The Loop. Forward pass: compute the network’s output for a given input. Loss: measure how wrong the output is. Backward pass: use the chain rule to compute the gradient of the loss with respect to every weight in the network—a number for each weight saying which direction would have reduced the error. Update: adjust each weight by a small amount in that direction. Repeat. The simplicity of the loop is almost embarrassing given what it produces. A network of a few neurons trained on a few hundred examples and a network of hundreds of billions of parameters trained on trillions of tokens are doing the same thing, at different scales.

Learned Representations. The crucial consequence of backpropagation for networks with hidden layers is that it discovers internal representations no human specified. As the paper stated, the hidden units “come to represent important features of the task domain.” A vision network discovers edge detectors in its first layers because the problem demands them; a language network develops internal geometry that places semantically similar concepts near each other. These representations are the network’s internal language, invented under the pressure of reducing error, and they are opaque because they were never labeled. This is the root of the interpretability challenge: we are trying to read a language the machine taught itself.

The Gradient Does Not Care. Backpropagation optimizes whatever error surface it is given. If the objective rewards plausible-sounding text over true text, gradient descent optimizes for plausibility, and the model will confabulate fluently. If the training data encode human bias, the gradient learns the bias as efficiently as it learns anything else. The indifference of the algorithm to the content of the destination is both the source of its power—it does not need to be told what features to learn—and the source of every alignment problem. The objectives, the data, and the training choices are entirely on the humans.

Scaling and the Bitter Lesson. Because the learning algorithm has been essentially fixed since 1986, progress in AI has been driven overwhelmingly by scale: more parameters, more data, more compute applied to the same backpropagation. This explains the AI scaling laws—the empirical regularities showing that model capability improves smoothly and predictably as scale increases. The algorithm is a constant; the inputs grow; the capabilities track the inputs. Richard Sutton’s “bitter lesson”—that general methods given sufficient compute reliably beat systems built around human knowledge of the problem—is the empirical confirmation of backpropagation’s reach. Rumelhart supplied the engine in 1986. The last decade supplied the fuel.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Related Entries

Further Reading