You On AI Encyclopedia · Selective Reporting The You On AI Encyclopedia Home
Txt Low Med High
CONCEPT

Selective Reporting

The practice of publishing the benchmark scores on which a model does well and omitting the ones on which it does not — a simple form of Goodhart's Law gaming that survives because no party has the standing to enforce disclosure.
Selective reporting is the systematic publication bias that results when labs are free to choose which benchmarks accompany a release. Every frontier model is run on many more evaluations than appear in its release announcement. The ones that appear are, with high reliability, the ones that flatter the model. The ones that are omitted are often the ones that would most inform a prospective user. This is not fraud; it is marketing, and it is the default behavior of any institution under commercial pressure. It nonetheless produces a comparison landscape in which every announced model looks better than every previous model on every measure, which cannot be collectively true.
Selective Reporting
Selective Reporting

In The You On AI Encyclopedia

The asymmetry is visible in release after release. A new model's press materials report the benchmarks on which it exceeds the previous generation or the nearest competitor. Benchmarks on which it merely matches are often excluded. Benchmarks on which it regresses are almost never shown. For users downstream of these announcements, the practical consequence is that a comparison across vendors using publicly reported scores is not a comparison; it is an artifact of which benchmarks each vendor chose to include. A model announced with strong MMLU and HumanEval and weak GSM8K may have been scored lower on GSM8K than its predecessor — and this regression is generally discoverable only by running the benchmark independently, which few parties do.

Selective reporting is amplified by the proliferation of benchmarks. With a hundred or more credible evaluations to choose from, any model will dominate on some subset by chance. A marketing team that reports the best-performing dozen of a hundred produces a picture that is each individually true and collectively misleading. The same mechanism operates within a single benchmark: reporting only the subject areas or difficulty tiers where the model excels, or reporting a single aggregate score that hides sub-score regressions, or reporting an average across a curated subset of languages, instruction types, or task formats.

Training-Data Contamination
Training-Data Contamination

Independent evaluation is the only structural corrective, and it is underpowered. Academic groups and a few well-funded independent evaluators (LMSYS, METR, Epoch AI, Apollo Research, and others) run their own evaluations and publish them. Their findings frequently diverge from lab-reported numbers, sometimes sharply. But independent evaluation is expensive, slow, and incomplete; by the time an independent eval of model version N is published, model version N+1 is out. The commercial cycle outpaces the independent cycle.

A cultural response has been the rise of preference-based and arena-style evaluations — LMSYS's Chatbot Arena most visibly — where users perform blind pairwise comparisons and Elo scores are tabulated. These evaluations resist selective reporting because the lab does not choose the prompts or run the comparisons. They have become some of the most-watched signals in the industry. They have their own weaknesses (arena prompts are not drawn from any specific distribution; user preferences aggregate multiple dimensions; arena effects can be gamed by style and length), but they are one of the few measurements that reflects what the model is actually like to use rather than what the lab wants you to believe.

Origin

Selective reporting is an ancient pattern in experimental science (p-hacking, publication bias, the file-drawer problem) that has moved wholesale into AI benchmarking. The pattern was identified in ML research by Pineau et al.'s work on reproducibility in reinforcement learning (2017–2019) and has been progressively documented as LLMs became commercial products. The proliferation of benchmark leaderboards in 2023–2024 made the problem systemic.

Key Ideas

The announced number is curated. Any benchmark in a release announcement has been selected, not sampled, from the lab's internal evaluations.

Post-Training Specialization
Post-Training Specialization

Omission is informative. A benchmark that a competitor reports and this lab does not has likely regressed.

Arena-style evaluations are partial corrective. They are harder to game and reveal real-use preferences; they are also noisy and style-sensitive.

Disclosure is voluntary and will remain so. No regulatory or standards body enforces benchmark-disclosure norms in 2025.

Further Reading

  1. Pineau, Joelle et al. Improving Reproducibility in Machine Learning Research (2020).
  2. Chiang, Wei-Lin et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024).
  3. Ioannidis, John P. A. Why Most Published Research Findings Are False (2005), for the underlying statistical pattern.

Three Positions on Selective Reporting

From Chapter 15 — how the Boulder, the Believer, and the Beaver each read this concept
Boulder · Refusal
Han's diagnosis
The Boulder sees in Selective Reporting evidence of the pathology — that refusal, not adaptation, is the correct posture. The garden, the analog life, the smartphone that is not bought.
Believer · Flow
Riding the current
The Believer sees Selective Reporting as the river's direction — lean in. Trust that the technium, as Kevin Kelly argues, wants what life wants. Resistance is fear, not wisdom.
Beaver · Stewardship
Building dams
The Beaver sees Selective Reporting as an opportunity for construction. Neither refuse nor surrender — build the institutional, attentional, and craft governors that shape the river around the things worth preserving.

Read Chapter 15 in the book →

Explore more
Browse the full You On AI Encyclopedia — over 8,500 entries
← Home 0%
CONCEPT Book →