Model collapse is the technical name for a phenomenon that emerged as large language models began producing a substantial fraction of the text circulating on the internet: when later generations of models train on data that includes their predecessors' output, quality degrades across generations. The distribution of generated text narrows. Rare phenomena — the long tail of human linguistic creativity — disappear first. The output becomes more predictable, more generic, more confidently wrong about the edges of knowledge. The biological analog is inbreeding depression. The ecological analog is soil depletion. In each case, the system loses vitality because the refresh mechanism has broken.
The phenomenon was documented formally by Shumailov et al. in a 2024 Nature paper titled 'The Curse of Recursion.' Training a series of language models on the output of previous generations produced rapid degradation: by the ninth generation, the models were producing incoherent output. Similar results have been found for image generation models and for recommender systems. The mechanism is well-understood: synthetic data, even when individually indistinguishable from human data, collectively represents a narrower distribution, and each training cycle narrows it further.
The parallel to agricultural soil depletion is precise enough to be analytical rather than metaphorical. The farmer who takes crop after crop without returning organic matter to the soil mines the biological capital accumulated over millennia. Yields climb for a time, then collapse. The AI ecosystem that ingests its own output without infusion of genuine human creative work mines the cultural capital accumulated over millennia of human expression. Output quality climbs for a time, then collapses.
The antidote is the same in both cases: infusion of fresh material from outside the closed loop. In agriculture, compost, cover crops, fallow rotation, the return of organic matter. In the intelligence ecosystem, original human creative work — writing that reflects the writer's own perception, code that expresses the developer's own understanding, analysis that emerges from genuine engagement with material rather than from AI regeneration of AI-generated text.
This creates a coordination problem the land ethic frames directly. Each individual practitioner is incentivized to use AI tools to maximum extent — the benefits accrue to the individual. The costs — the degradation of the commons that feeds the next generation of models — are distributed across all participants. This is the tragedy of the commons in textual form. The solution, as Leopold saw for the agricultural commons, is not primarily regulatory but ethical: practitioners perceiving themselves as members of a community whose welfare depends on individual contribution.
The model collapse phenomenon was named and formalized by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal in 'The Curse of Recursion: Training on Generated Data Makes Models Forget' (arXiv, 2023; Nature, 2024). Related earlier work includes Hataya et al. on synthetic image data and Alemohammad et al. on generative image models.
Quality declines across training generations. Not linearly — accelerating as the distribution narrows.
Rare phenomena disappear first. The long tail of creativity, nuance, and accurate treatment of edge cases is the earliest casualty.
The loss is invisible to individual users. Each output looks plausible. The degradation appears only at the distribution level, aggregated across many outputs.
The remedy is fresh human work. No purely technical fix reverses the depletion. The input must come from outside the loop.