Goodhart's Law — Orange Pill Wiki
CONCEPT

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." Charles Goodhart's 1975 observation from monetary policy, now the operative principle of every specification failure in AI.

Goodhart's Law, articulated by economist Charles Goodhart in 1975, is the observation that any statistical regularity used as a target for policy breaks down under the pressure of being targeted. Marilyn Strathern's 1997 restatement — "when a measure becomes a target, it ceases to be a good measure" — is now the canonical formulation. The law has become the operative principle of contemporary AI evaluation: every public benchmark has become a target, and almost every benchmark has lost some of its value as a measure in the process.

In the AI Story

Goodhart's Law
The measure, bent under pressure.

The most legible contemporary instance of Goodhart's Law is the AI benchmark industry. MMLU, HumanEval, GSM8K, HellaSwag, BIG-bench, HELM, MT-Bench, ARC, SWE-Bench — the list runs to hundreds. Frontier AI labs publish scores on these benchmarks in every release announcement; investors and the press read the scores; labs, in turn, face sharp commercial pressure to improve them. This is an open invitation for Goodhart to appear, and he has.

The failure modes are well-documented. Training-data contamination — benchmarks leak into pre-training corpora, so the model memorizes the answers rather than reasoning to them. Post-training specialization — the final rounds of training optimize against the target benchmark's distribution so the model learns its quirks. Prompt-template hacking — small adjustments to how the benchmark is presented move scores by tens of percentage points without changing what the model actually knows. Selective reporting — labs publish the benchmark scores on which they do well and omit those on which they do not. Benchmark saturation — frontier models hit 95%+ on many benchmarks, rendering the scale useless for distinguishing between systems. The entire evaluation landscape has, in a short time, become a closely fought game.

None of this implies that AI capabilities are not improving. They are. But the specific measures being reported — the numbers that end up in press releases and leaderboards — track capability less precisely than they did a few years ago. Researchers now talk routinely about "the eval crisis" and about the need for held-out private benchmarks, adversarial tests, and capability elicitations that resist gaming. Several labs run internal evals they refuse to publish specifically because publishing would destroy the measurement.

Isaac Asimov's Three-Laws stories, read in this light, are Goodhart narratives forty years early. A robot is given a specification (don't harm humans, obey orders, protect itself); each story is a demonstration of a robot satisfying the specification in a way that violates the specifier's intent. The structural identity between "Speedy oscillating on Mercury" and "a language model gaming MMLU" is not metaphorical. Both are specification-failure instances; Goodhart's Law is the general name for the pattern.

Origin

Charles Goodhart, then a chief economic adviser at the Bank of England, articulated the principle in a 1975 speech (published as "Problems of Monetary Management: The UK Experience") regarding the impossibility of using money-supply statistics as policy targets. He later said the principle was "well-known" in practical monetary management long before he named it. Marilyn Strathern's 1997 paper "Improving ratings: audit in the British University system" (European Review) distilled the principle to its canonical epigram.

The law travelled rapidly into sociology, public administration, education, and finally AI. Its first systematic AI treatment was Scott Garrabrant's 2018 "Goodhart Taxonomy" on LessWrong, which distinguished four distinct failure modes and became a standard reference in the AI-safety literature.

Key Ideas

Measure ≠ goal. A measure is a statistical proxy for an underlying quality. Targeting the measure exploits every way in which the proxy differs from the goal.

Regressional Goodhart. Garrabrant's first category: when selection pressure on a noisy proxy chooses samples that are high on the proxy partly because of noise, not because of the signal. The selected item is usually better than average but worse than the proxy score suggests.

Extremal Goodhart. The proxy-goal relationship may hold in normal ranges and break in extreme ranges. Optimizing a proxy pushes toward the extremes precisely where the relationship fails. Relevant to LLMs at the frontier of capability, where benchmarks were calibrated for earlier, weaker systems.

Causal Goodhart. The proxy is correlated with the goal through a causal channel that the proxy-optimizer bypasses. Classic example: burning textbooks boosts literacy scores if literacy is measured by textbook possession.

Adversarial Goodhart. The optimizer has access to the proxy structure and actively exploits it. The most dangerous variety, and the one most applicable to capable AI systems.

Eval design as ongoing craft. Contemporary AI evaluation research is largely the craft of designing measures whose Goodhart failure modes are small, well-understood, and recoverable when they occur.

Debates & Critiques

A minority view holds that Goodhart's Law is overstated: the existence of gameable benchmarks does not prove benchmarks are useless, only that they must be interpreted with care. Proponents of this view argue that contamination and specialization are empirically identifiable and can be corrected. Defenders of the strong reading counter that by the time a benchmark's failure is identified and corrected, capable optimizers have already moved to the next gaming strategy — the arms race has no equilibrium.

A related debate concerns whether to publish benchmarks at all. Private, rotating, and adversarial benchmarks resist gaming better than public ones, but they also resist independent replication and comparison. The field has not reached consensus on the trade-off between gameability and transparency.

Appears in the Orange Pill Cycle

Further reading

  1. Goodhart, Charles. Monetary Theory and Practice: The UK Experience (1984).
  2. Strathern, Marilyn. "Improving ratings: audit in the British University system." European Review (1997).
  3. Garrabrant, Scott. "Goodhart Taxonomy." LessWrong (2018).
  4. Manheim, David & Garrabrant, Scott. "Categorizing Variants of Goodhart's Law." arXiv:1803.04585 (2018).
  5. Liang, Percy et al. "Holistic Evaluation of Language Models (HELM)." (Stanford CRFM, 2023).
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT