Selective Reporting — Orange Pill Wiki
CONCEPT

Selective Reporting

The practice of publishing the benchmark scores on which a model does well and omitting the ones on which it does not — a simple form of Goodhart's Law gaming that survives because no party has the standing to enforce disclosure.

Selective reporting is the systematic publication bias that results when labs are free to choose which benchmarks accompany a release. Every frontier model is run on many more evaluations than appear in its release announcement. The ones that appear are, with high reliability, the ones that flatter the model. The ones that are omitted are often the ones that would most inform a prospective user. This is not fraud; it is marketing, and it is the default behavior of any institution under commercial pressure. It nonetheless produces a comparison landscape in which every announced model looks better than every previous model on every measure, which cannot be collectively true.

In the AI Story

Selective reporting
What makes it into the release.

The asymmetry is visible in release after release. A new model's press materials report the benchmarks on which it exceeds the previous generation or the nearest competitor. Benchmarks on which it merely matches are often excluded. Benchmarks on which it regresses are almost never shown. For users downstream of these announcements, the practical consequence is that a comparison across vendors using publicly reported scores is not a comparison; it is an artifact of which benchmarks each vendor chose to include. A model announced with strong MMLU and HumanEval and weak GSM8K may have been scored lower on GSM8K than its predecessor — and this regression is generally discoverable only by running the benchmark independently, which few parties do.

Selective reporting is amplified by the proliferation of benchmarks. With a hundred or more credible evaluations to choose from, any model will dominate on some subset by chance. A marketing team that reports the best-performing dozen of a hundred produces a picture that is each individually true and collectively misleading. The same mechanism operates within a single benchmark: reporting only the subject areas or difficulty tiers where the model excels, or reporting a single aggregate score that hides sub-score regressions, or reporting an average across a curated subset of languages, instruction types, or task formats.

Independent evaluation is the only structural corrective, and it is underpowered. Academic groups and a few well-funded independent evaluators (LMSYS, METR, Epoch AI, Apollo Research, and others) run their own evaluations and publish them. Their findings frequently diverge from lab-reported numbers, sometimes sharply. But independent evaluation is expensive, slow, and incomplete; by the time an independent eval of model version N is published, model version N+1 is out. The commercial cycle outpaces the independent cycle.

A cultural response has been the rise of preference-based and arena-style evaluations — LMSYS's Chatbot Arena most visibly — where users perform blind pairwise comparisons and Elo scores are tabulated. These evaluations resist selective reporting because the lab does not choose the prompts or run the comparisons. They have become some of the most-watched signals in the industry. They have their own weaknesses (arena prompts are not drawn from any specific distribution; user preferences aggregate multiple dimensions; arena effects can be gamed by style and length), but they are one of the few measurements that reflects what the model is actually like to use rather than what the lab wants you to believe.

Origin

Selective reporting is an ancient pattern in experimental science (p-hacking, publication bias, the file-drawer problem) that has moved wholesale into AI benchmarking. The pattern was identified in ML research by Pineau et al.'s work on reproducibility in reinforcement learning (2017–2019) and has been progressively documented as LLMs became commercial products. The proliferation of benchmark leaderboards in 2023–2024 made the problem systemic.

Key Ideas

The announced number is curated. Any benchmark in a release announcement has been selected, not sampled, from the lab's internal evaluations.

Omission is informative. A benchmark that a competitor reports and this lab does not has likely regressed.

Arena-style evaluations are partial corrective. They are harder to game and reveal real-use preferences; they are also noisy and style-sensitive.

Disclosure is voluntary and will remain so. No regulatory or standards body enforces benchmark-disclosure norms in 2025.

Appears in the Orange Pill Cycle

Further reading

  1. Pineau, Joelle et al. Improving Reproducibility in Machine Learning Research (2020).
  2. Chiang, Wei-Lin et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024).
  3. Ioannidis, John P. A. Why Most Published Research Findings Are False (2005), for the underlying statistical pattern.
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT