CONCEPT

Training-Data Contamination

When evaluation benchmarks leak into the training corpus of a language model, the model can <em>memorize</em> the answers instead of reasoning to them — which makes benchmark scores a measure of leakage rather than capability.

Training-data contamination is the phenomenon in which items from an evaluation benchmark appear, verbatim or paraphrased, somewhere in a model's pre-training or fine-tuning corpus. Because modern language models are trained on trillions of tokens scraped from the open web, and because benchmark datasets are themselves published on the open web, some degree of contamination is nearly unavoidable for any benchmark that has been public for more than a year. The consequence is that a model can "ace" a benchmark through recognition and recall rather than through the skill the benchmark was designed to measure. It is the single most corrosive failure mode in AI evaluation and the most common mechanism by which Goodhart's Law is instantiated in 2024–2025 frontier model releases.

In The You On AI Field Guide

The scale of the problem is easy to under-appreciate. A frontier pre-training corpus is around fifteen trillion tokens. It contains much of the public web, a large share of arXiv, most of GitHub, multiple

In The You On AI Field Guide

Keep reading with YOU ON AI