The finite supply of high-quality human-generated training text — estimated at ten to twenty trillion tokens — approaching exhaustion as frontier AI models consume a significant fraction of the available corpus, threatening the scaling laws' continued validity unless the industry rotates onto new data dimensions.
Large language models are trained on text scraped from the internet, books, code repositories, and the accumulated written output of human civilization. This corpus is large but finite. Estimates suggest that high-quality English-language text available for training amounts to roughly ten to twenty trillion tokens. Current frontier models are trained on a significant fraction of this total. The next doubling of training data cannot come from the same source, because the source is approaching exhaustion.
The Data Wall
In The You On AI Field Guide
The data wall is the AI equivalent of a semiconductor fabrication limit: a physical reality that the self-reinforcing economic cycle must either accommodate or be broken by. Moore's framework suggests the accommodation will happen — that the industry will rotate, as the semiconductor industry did, to find new dimensions of growth when old dimensions saturate. The responses already visible include synthetic data generation (where