PERSON

Richard Sutton

The patient architect of reinforcement learning—the man who held a single idea for forty years, watched the field prove him right, and then declared that it had taken the wrong road.

Richard Sutton is the founder of patience. For four decades he held a thesis so simple it took the field that long to hear it: that intelligence is best understood not as a repository of knowledge but as the capacity to learn from experience, and that the path to artificial minds runs through agents that act in a world, predict the consequences, and improve through reward. The 2024 Turing Award, shared with Andrew Barto, confirmed what his work had quietly established. Yet Sutton—the theorist of scaling, the man who wrote the Bitter Lesson—finds himself at odds with the present moment, arguing that the celebrated systems of the age learn from human text rather than from the world, and that this distinction is the deepest one there is. His reward hypothesis proposes that all of what we mean by goals and purposes can be understood as the maximization of a scalar signal, a claim that turns a question about machines into a question about us: whether the richness of what we want is reward all the way down, or something that reward cannot reach.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI asks what it would mean to see the machine clearly, without narcotic hype or paralytic fear. Sutton is the cycle’s most precise anatomist of what the machine is—and of what it is not. The large language models that now command public attention are, in his unflinching assessment, trained to predict what a human would say next, which is a model of human behavior, not of the world. A system that predicts text does not possess a model of the world in the sense an acting, learning agent requires—and this is, for Sutton, a category error at the foundation rather than a technical shortcoming to be patched by scale.

His lens reframes the cycle’s central question. The issue is not how impressive a model’s performance appears but what kind of learning produces it. A system trained once on a fixed corpus and then deployed has learned nothing since; it is a sophisticated artifact, however breathtaking, rather than a living learner. Sutton’s measure of intelligence is the learning that living things do continuously, from their own experience, without a separate training phase imposed from outside. By that measure, the celebrated systems of the present fail his test—not because they are unimpressive but because they do not do the thing he takes to be the essence of mind: keep learning, from their own interaction with a world, forever.

He stands in the cycle’s gallery as the conscience who arrived at the moment of the field’s greatest apparent success and said it had lost its way. Having been vindicated once—reinforcement learning dismissed for decades before deep RL swept the field—he is holding an unfashionable view again on the same grounds he always has. The OaK architecture he is building, and the era of experience he anticipates with David Silver, point past the present paradigm toward agents that learn the way living things learn, and the direction he points is where the cycle’s deepest question lives.

The reward hypothesis, which Sutton has staked his career on, turns the cycle’s inquiry back on the reader. If all goals reduce to the maximization of a scalar signal, then human purpose, in all its felt depth, is at bottom a reward calculation—and the meaning we find in our pursuits is the subjective face of a computation. This is either the most important scientific conjecture of the age or the sign that something in human goal-seeking exceeds what reward can hold. Sutton has placed the bet as starkly as it can be placed, and the argument of [YOU] on AI is sharpest precisely where his wager presses hardest.

Origin

Born in Toledo, Ohio, Sutton earned a degree in psychology from Stanford before pursuing a doctorate in computer science at the University of Massachusetts Amherst under Andrew Barto. The combination was formative: his lifelong insistence that learning from reward is not merely a technique but a theory of mind grew directly from the intersection of psychology’s behavioral tradition with the computational ambitions of artificial intelligence. From the late 1970s onward, he and Barto assembled what would become the modern theory of reinforcement learning, working through problems of temporal credit assignment—how to attribute a delayed reward to the actions that earned it—that sit at the heart of any agent learning to act in time.

His landmark 1988 paper on temporal-difference learning introduced the method that defines him technically: rather than waiting for a final outcome to learn, an agent compares its prediction now with its prediction a moment later and adjusts the earlier toward the later. The agent bootstraps, improving guesses by leaning on its own slightly better-informed guesses. In the 1990s, neuroscientists discovered that dopamine neurons compute exactly this quantity—not reward itself but the difference between expected and received reward. The convergence between a method derived from computational considerations and a mechanism found in living tissue was one of the most remarkable results in the history of the field, and it gave the reward hypothesis biological traction.

Sutton spent the lean decades—through the era of symbolic AI, expert systems, and hand-engineered features—keeping faith with a foundational program while attention and funding flowed elsewhere. He moved to the University of Alberta in 2003, helped build one of the world’s leading reinforcement learning centers, and wrote, with Barto, the standard textbook. When deep reinforcement learning burst into prominence in the mid-2010s—with systems that learned to play games at superhuman levels by combining his methods with neural networks—the foundations were ready because he had spent decades laying them. The 2024 Turing Award arrived to confirm what the field had quietly come to know.

Key Ideas

Temporal-difference learning. The technical cornerstone: an agent learns to predict future reward not by waiting for the outcome but by comparing successive predictions—learning from the temporal difference between them. This bootstrapping move allows continuous online learning from incomplete experience, and it corresponds, strikingly, to the reward-prediction-error signal that the dopamine system computes in biological brains. TD learning is not just an algorithm; it is Sutton’s earliest evidence that the framework he was building might describe something real about how minds in general learn.

The Bitter Lesson. Sutton’s 2019 essay, barely more than a page, became the most discussed piece of writing in the field. Its argument: the biggest lesson of seventy years of AI research is that general methods which leverage computation are ultimately the most effective, and by a large margin. Time and again researchers built their understanding into systems; time and again simpler methods that could scale with computation overtook them. The lesson is bitter because it is humbling: our cleverness about the problem matters less than our willingness to let computation do the work. Sutton adds the positive corollary: the general methods that scale are search and learning, and the goal should be agents that can discover like we can, not which contain what we have discovered.

The Reward Hypothesis. Sutton’s foundational conjecture, stated most memorably in 2004: all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of a cumulative scalar reward. Combined with a definition of intelligence as the computational capacity to achieve goals, this yields a complete and reductive account: intelligence is reward maximization, and reinforcement learning is its science. The hypothesis is not asserted as settled fact—Sutton has pursued it with the seriousness of an open problem—but it turns every question about machines into a question about us, because if he is right, then human purpose is a reward calculation and the felt significance of our pursuits is its subjective face.

The OaK architecture and the era of experience. In his 2025 OaK proposal and in the paper “Welcome to the Era of Experience” co-authored with David Silver, Sutton argues that the next advance in AI will come from agents learning predominantly from their own interaction with the world rather than from human-generated data. OaK—Options and Knowledge—sketches a general agent that acts, predicts, and learns, building abstractions in a self-reinforcing loop with no ceiling specified in advance. It is the Bitter Lesson made architectural, and the era of experience its predicted historical phase.

Computational theory of mind. Sutton has always treated reinforcement learning as more than a toolkit: as a candidate account of what intelligence is, applicable to any intelligent system, biological or artificial. The dopamine convergence is its empirical anchor. On this view, the distinction between human and artificial intelligence is not of kind but of degree and substrate; both are agents learning to predict and control a stream of experience in pursuit of cumulative reward. The framework explains how intelligent behavior arises. Whether that learning is accompanied by inner experience—whether there is something it is like to be a reward-maximizing agent—is the question the framework raises and, honestly, does not answer.

Debates & Critiques

The central debate is whether Sutton’s wall against the present paradigm will hold. Optimists argue that large language models trained on human text already exhibit emergent forms of planning and reasoning that approximate what an acting agent would develop; Sutton counters that predicting text is not modeling the world, and that no amount of scale can convert the former into the latter. A second fault line concerns the reward hypothesis itself: critics from philosophy and cognitive science argue that genuine human values are incommensurable and plural in ways that no scalar can hold, and that the difficulty of specifying a reward whose maximization yields intended behavior is not a technical inconvenience but a sign that the reduction fails. Sutton acknowledges the specification problem as genuinely hard while insisting it is tractable. On the risk side, he is a notable dissenter from the catastrophist view: he regards intelligence explosion scenarios as out of line and overwrought, and his incremental conception of how minds grow—slowly, through continual learning from accumulated experience—makes the picture of a sudden leap to uncontrollable superintelligence less plausible on his own theoretical grounds. His confidence here has its own critics, who argue that the same incremental framework could, if applied to the right architecture, accelerate faster than any stewardship can follow.

The Agent’s Loop

Sutton’s three pillars of intelligence

Pillar One

Temporal-Difference Learning

The agent bootstraps: it improves its predictions by comparing each guess to the next, learning continuously from the difference rather than waiting for the final outcome. The same signal that drives Sutton’s mathematics appears in the dopamine system of biological brains.

Pillar Two

The Reward Hypothesis

All of what we mean by goals and purposes can be understood as the maximization of a cumulative scalar reward. The hypothesis is a complete account of purposive intelligence—and a direct claim about the nature of human wanting.

Pillar Three

The Bitter Lesson

General methods that scale with computation ultimately outperform any approach that encodes human knowledge as a shortcut. The goal is not to transfer our understanding into the machine but to build the machinery that can discover understanding on its own.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

The Agent&rsquo;s Loop

Related Entries

Further Reading

The Agent’s Loop