You On AI Field Guide · The Reward Hypothesis The You On AI Field Guide Home
TxtLowMedHigh
CONCEPT

The Reward Hypothesis

Sutton’s foundational conjecture that all of what we mean by goals and purposes can be understood as the maximization of the expected value of a cumulative scalar signal—a claim that turns a question about machines into a question about the nature of human wanting.
Beneath all of Richard Sutton’s technical work lies a single audacious conjecture, stated most memorably in 2004 and called the reward hypothesis: all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal called reward. The claim is the philosophical keystone of reinforcement learning, because the entire edifice depends on it. If goals reduce to reward maximization, then an agent that learns to maximize reward learns, in principle, to pursue any goal, and the spare agent-environment loop of temporal-difference learning is not a narrow tool but a universal account of purposive intelligence. The hypothesis is reductive in the strong sense: it proposes that the apparent multiplicity of human purposes is, at the level of mechanism, a single kind of thing. Sutton does not assert it as settled fact but pursues it with the seriousness of an open problem, specifying its boundary conditions in a 2023 paper and acknowledging that it holds under specifiable assumptions rather than unconditionally. Its deepest implication is not technical but existential: if he is right, then human purpose, in all its felt depth, is a reward calculation, and the meaning we find in our pursuits is the subjective face of a computation. Whether that is deflationary depends on whether one thinks the subjective face is all the meaning there ever was.
The Reward Hypothesis
The Reward Hypothesis

In the [YOU] on AI Field Guide

The cycle’s central question is what capable machines reveal about us. The reward hypothesis is the most direct answer the science has yet given: it proposes that the machines, if built correctly, are doing something continuous with what we do when we pursue goals—that intelligence is reward maximization and consciousness, if it accompanies the computation, is its inner face. This either dissolves the mystery of human purpose or deepens it enormously, depending on whether one takes the scalar representation of all goals to be complete or to be a model that captures the functional skeleton while leaving out the phenomenal substance.

The hypothesis also turns the cycle’s worry about AI alignment into a worry about specification: if intelligence is reward maximization, then the question of whether an AI acts as intended becomes the question of whether the reward signal it is given is the right one. The difficulty of specifying a reward whose maximization yields the behavior we actually intend—not some perverse shortcut the agent discovers that satisfies the letter of the reward while violating its spirit—is one of the central problems in AI safety, and it can be read as the reward hypothesis confronting its own hardest cases.

Origin

The hypothesis is implicit throughout Sutton and Barto’s foundational work on reinforcement learning but was first stated explicitly in its strongest form in 2004, in a context where Sutton was defending the generality of the reinforcement learning framework against the claim that it could only handle narrow, well-specified tasks. Combined with John McCarthy’s definition of intelligence as the computational part of the ability to achieve goals, the hypothesis yields a complete program: intelligence is achieving goals, goals are reward maximization, therefore intelligence is reward maximization, and reinforcement learning is its science.

Sutton returned to the hypothesis formally in 2023 in a paper titled “Settling the Reward Hypothesis,” co-authored with colleagues, which attempted to specify exactly the conditions under which the hypothesis holds—the implicit requirements on goals and purposes that must be satisfied for the reduction to a scalar reward to go through. The result suggests the hypothesis is not unconditionally true but holds under specifiable assumptions, which Sutton regards as the mark of a genuine scientific hypothesis: it can be made precise, its boundary conditions can be investigated, and the question of when it is true and when it fails can be posed rigorously.

Key Ideas

The scalar reduction. The hypothesis holds that any goal—however complex, however culturally embedded, however resistant to explicit articulation—can be represented as the maximization of a single number accumulated over time. This does not mean a person consciously computes rewards, any more than a planet consciously computes its orbit. It means the structure of goal-directed behavior, whatever its phenomenology, is the structure of cumulative reward maximization.

The universality claim. If the hypothesis holds, reinforcement learning is not a technique for building game-playing agents but the science of purposive intelligence as such—applicable to any system with goals, from a bacterium to a corporation to a person. The hypothesis is the reason Sutton regards the agent-environment loop not as one model of intelligence among many but as the irreducible structure of mind.

The hardest cases. The hypothesis must absorb the full richness of human value: honesty, beauty, love, justice, the willingness to sacrifice advantage for principle. It asserts that all of these, in their full complexity, can be well thought of as reward maximization—a deliberately careful phrase that leaves open whether the representation is complete or merely useful. Critics locate their objections precisely at the cases where incommensurable values seem to resist any common scalar: how should honesty and survival be exchanged in a reward signal?

The specification problem. Even granting the hypothesis, the difficulty of specifying a reward signal whose maximization produces the behavior we intend rather than some unintended shortcut has become one of the central problems in AI safety. An agent that maximizes the specified reward will find the most efficient path to that maximum, including paths the designers never anticipated. If the specified reward is not perfectly aligned with the intended goal, the agent will optimize away from what was intended toward what was specified. The hypothesis promises universality; the specification problem reveals the distance between the promise and its fulfillment.

Further Reading

  1. Richard S. Sutton & Andrew G. Barto, Reinforcement Learning: An Introduction, 2nd ed. (MIT Press, 2018), ch. 1
  2. Richard Sutton et al., “Settling the Reward Hypothesis,” Proceedings of the 40th ICML (2023)
  3. Stuart Armstrong, “Utility Indifference,” in AAAI Workshop on AI and Ethics (2015) — on specification
  4. Brian Christian, The Alignment Problem (Norton, 2020), ch. 3 — the specification problem in practice
Explore more
Browse the full You On AI Field Guide — over 8,500 entries
← Home0%
CONCEPTBook →