The scale of the problem is easy to under-appreciate. A frontier pre-training corpus is around fifteen trillion tokens. It contains much of the public web, a large share of arXiv, most of GitHub, multiple book corpora, and — inevitably — every benchmark dataset that has ever been uploaded, discussed on Stack Overflow, pasted into a blog post, or referenced in a research paper. MMLU, HumanEval, GSM8K, MATH, BIG-bench, ARC — all of them are present. So are their solutions, their error analyses, and their leaderboards. A model trained on this corpus has seen each benchmark question many times, often alongside the correct answer, often with commentary explaining why the answer is correct.
The empirical evidence for contamination is overwhelming for older benchmarks and strongly suggestive for newer ones. When researchers test models on "held-out" versions of benchmarks — versions whose questions were written after the model's training cutoff and have never appeared on the web — scores drop, often by large margins. Studies of Chinchilla, GPT-4, Claude, Gemini, and Llama variants have found that performance on canary-detection tests (questions designed to check whether a specific item was in the training data) is well above chance. Papers by Sainz et al., Oren et al., and Dodge et al. have documented both the prevalence and the performance inflation that contamination produces.
Contamination takes several forms. Direct inclusion: the exact benchmark item is in the training corpus, verbatim. Paraphrased inclusion: a Stack Exchange discussion or blog post contains the problem restated in different words with the answer attached. Solution-pattern inclusion: the benchmark itself is not present but problems with very similar structure are, so the model learns the benchmark's stylistic conventions and answer formats. Meta-contamination: the benchmark is not in the training data but a paper analyzing its failure modes, with representative examples, is.
Mitigations exist and none are fully satisfactory. Held-out private benchmarks (LiveCodeBench, the private subset of SWE-Bench, newly authored questions after the cutoff) work until they are used enough that researchers infer their structure. Canary strings embedded in benchmark releases let authors detect contamination after the fact but do not prevent it. Decontamination pipelines that remove n-gram matches from training data reduce verbatim contamination but miss paraphrase. Dynamic benchmarks — evaluations whose content changes on each run — are promising but more expensive to build. The most honest labs run multiple held-out evaluations whose content they refuse to publish, explicitly because publication would destroy the measurement.
The concept of train-test contamination is as old as machine learning itself — it is the first warning in every introductory course. What is new is the scale at which it matters. Before 2020, the train set and the test set for a given benchmark were separate files maintained by a benchmark's authors; keeping them apart was a matter of discipline. In the pre-training era, the "train set" became "approximately the entire public internet," and the test set became a file that, having been released publicly, is also somewhere in the train set. The first high-visibility contamination paper on a frontier model was Brown et al. (2020), the GPT-3 paper, which openly reported a contamination analysis and acknowledged several benchmarks were likely affected. The issue has grown in salience continuously since.
Public + large = contaminated. Any benchmark released on the open web is, with near-certainty, in the training corpus of any frontier model trained after its release.
Verbatim detection is easy; paraphrase detection is hard. N-gram filters catch direct copies and miss restated versions, which contribute most of the inflation.
Contamination inflates scores asymmetrically. Easy items the model would have gotten anyway are unaffected; hard items shift from "failed reasoning" to "recalled answer," which is where the inflation lives.
The eval crisis is structural, not malicious. No lab sets out to contaminate; contamination is the default outcome of training on the open web.