CONCEPT

The Data Wall

The finite supply of high-quality human-generated training text — estimated at ten to twenty trillion tokens — approaching exhaustion as frontier AI models consume a significant fraction of the available corpus, threatening the scaling laws' continued validity unless the industry rotates onto new data dimensions.

Large language models are trained on text scraped from the internet, books, code repositories, and the accumulated written output of human civilization. This corpus is large but finite. Estimates suggest that high-quality English-language text available for training amounts to roughly ten to twenty trillion tokens. Current frontier models are trained on a significant fraction of this total. The next doubling of training data cannot come from the same source, because the source is approaching exhaustion.

In the AI Story

Hedcut illustration for The Data Wall — The Data Wall

The data wall is the AI equivalent of a semiconductor fabrication limit: a physical reality that the self-reinforcing economic cycle must either accommodate or be broken by. Moore's framework suggests the accommodation will happen — that the industry will rotate, as the semiconductor industry did, to find new dimensions of growth when old dimensions saturate. The responses already visible include synthetic data generation (where AI models produce training data for other AI models), multimodal training (where text is supplemented with images, video, and audio), and efficiency improvements that extract more capability from less data.

Each response carries its own risks. Synthetic data generation raises concerns about model collapse — the degradation of quality when models are trained on outputs from other models rather than on fresh human creative work. Multimodal training expands the available corpus but introduces questions about whether video and image tokens are genuinely equivalent to text tokens in terms of capability contribution. Efficiency improvements face theoretical limits; there is only so much information extractable from a given data set.

Unlike semiconductor walls, which could be characterized precisely by physics, the data wall has ambiguous location. How much text is enough depends on the efficiency of the training algorithm, the architecture of the model, the definition of 'high-quality,' and the degree to which synthetic data can substitute for human-generated data. None of these variables is as well understood as photolithography physics. The wall is real, but its location is uncertain — and uncertainty itself is a form of risk that investment strategies must accommodate.

Origin

The data wall concept emerged in AI research around 2022, with researchers at Epoch AI, Villalobos et al., and others publishing estimates of high-quality token availability and projected exhaustion timelines. The specific framing as an analog to semiconductor fabrication limits draws on applications of Moore's framework to AI scaling by researchers including Jensen Huang (who coined 'Moore's Law squared'), Jaime Sevilla, and others tracking empirical compute trends.

Key Ideas

High-quality text is finite. The estimated ten to twenty trillion tokens of high-quality English text represent a physical ceiling on conventional training data.

Frontier models are approaching saturation. Current training runs consume significant fractions of the available corpus, leaving less room for the next doubling.

Synthetic data is a partial solution. AI-generated training data can extend the corpus but risks model collapse if used exclusively.

Multimodal expansion is a dimensional rotation. Video, image, and audio tokens represent new dimensions of scaling when text saturates.

The wall's location is uncertain. Unlike semiconductor limits, the data wall cannot be characterized precisely, complicating investment and research planning.

Appears in the Orange Pill Cycle

Gordon Moore — On AI

The Data Wall

In the AI Story

Origin

Key Ideas

Appears in the Orange Pill Cycle

Related Entries

Further reading