
The cycle’s central question is what capable machines reveal about us. The reward hypothesis is the most direct answer the science has yet given: it proposes that the machines, if built correctly, are doing something continuous with what we do when we pursue goals—that intelligence is reward maximization and consciousness, if it accompanies the computation, is its inner face. This either dissolves the mystery of human purpose or deepens it enormously, depending on whether one takes the scalar representation of all goals to be complete or to be a model that captures the functional skeleton while leaving out the phenomenal substance.
The hypothesis also turns the cycle’s worry about AI alignment into a worry about specification: if intelligence is reward maximization, then the question of whether an AI acts as intended becomes the question of whether the reward signal it is given is the right one. The difficulty of specifying a reward whose maximization yields the behavior we actually intend—not some perverse shortcut the agent discovers that satisfies the letter of the reward while violating its spirit—is one of the central problems in AI safety, and it can be read as the reward hypothesis confronting its own hardest cases.
The hypothesis is implicit throughout Sutton and Barto’s foundational work on reinforcement learning but was first stated explicitly in its strongest form in 2004, in a context where Sutton was defending the generality of the reinforcement learning framework against the claim that it could only handle narrow, well-specified tasks. Combined with John McCarthy’s definition of intelligence as the computational part of the ability to achieve goals, the hypothesis yields a complete program: intelligence is achieving goals, goals are reward maximization, therefore intelligence is reward maximization, and reinforcement learning is its science.
Sutton returned to the hypothesis formally in 2023 in a paper titled “Settling the Reward Hypothesis,” co-authored with colleagues, which attempted to specify exactly the conditions under which the hypothesis holds—the implicit requirements on goals and purposes that must be satisfied for the reduction to a scalar reward to go through. The result suggests the hypothesis is not unconditionally true but holds under specifiable assumptions, which Sutton regards as the mark of a genuine scientific hypothesis: it can be made precise, its boundary conditions can be investigated, and the question of when it is true and when it fails can be posed rigorously.
The scalar reduction. The hypothesis holds that any goal—however complex, however culturally embedded, however resistant to explicit articulation—can be represented as the maximization of a single number accumulated over time. This does not mean a person consciously computes rewards, any more than a planet consciously computes its orbit. It means the structure of goal-directed behavior, whatever its phenomenology, is the structure of cumulative reward maximization.
The universality claim. If the hypothesis holds, reinforcement learning is not a technique for building game-playing agents but the science of purposive intelligence as such—applicable to any system with goals, from a bacterium to a corporation to a person. The hypothesis is the reason Sutton regards the agent-environment loop not as one model of intelligence among many but as the irreducible structure of mind.
The hardest cases. The hypothesis must absorb the full richness of human value: honesty, beauty, love, justice, the willingness to sacrifice advantage for principle. It asserts that all of these, in their full complexity, can be well thought of as reward maximization—a deliberately careful phrase that leaves open whether the representation is complete or merely useful. Critics locate their objections precisely at the cases where incommensurable values seem to resist any common scalar: how should honesty and survival be exchanged in a reward signal?
The specification problem. Even granting the hypothesis, the difficulty of specifying a reward signal whose maximization produces the behavior we intend rather than some unintended shortcut has become one of the central problems in AI safety. An agent that maximizes the specified reward will find the most efficient path to that maximum, including paths the designers never anticipated. If the specified reward is not perfectly aligned with the intended goal, the agent will optimize away from what was intended toward what was specified. The hypothesis promises universality; the specification problem reveals the distance between the promise and its fulfillment.