CONCEPT

Training Data as Public Good

The conversion of humanity's accumulated written output — produced over centuries, sustained by public education and research — into private proprietary value, without compensation flowing back to the public that produced the resource.

The training data on which large language models depend represents what may be the most consequential conversion of a public good into private value in economic history. Trillions of tokens drawn from books, scientific papers, encyclopedias, government publications, legal documents, academic theses — virtually every form of written expression that has been digitized. This text was not produced by the companies that train their models on it. It was produced by billions of people over centuries, working within institutions substantially funded by public investment: publicly funded schools and universities, government-funded research programs, public libraries and archives, the publicly built internet. Mazzucato's framework names what is happening: value extraction at civilizational scale. The creation happened across centuries of human intellectual production. The extraction happens when that production is ingested, processed through neural network architectures, and converted into a proprietary model whose capabilities reflect the accumulated knowledge of the training data but whose ownership resides entirely with the company that performed the conversion.

In the AI Story

Hedcut illustration for Training Data as Public Good — Training Data as Public Good

The training data is not a natural resource discovered by private companies. It is a cultural resource produced by human civilization over millennia, sustained by public investment in education, research, libraries, archives, and communication infrastructure. In any meaningful analytical sense, it is a public good. The books were written by authors educated in publicly funded schools and published through an industry sustained by public literacy programs. The scientific papers were written by researchers funded by government grants. Wikipedia was produced by volunteer labor building on publicly funded education.

The legal framework governing this conversion is inadequate. Copyright law was designed to protect individual works of authorship, not to address the systematic extraction of value from the entire corpus of human knowledge. The concept of fair use was not designed for the ingestion of trillions of tokens for commercial purposes. Courts in multiple jurisdictions are adjudicating whether AI training constitutes fair use, but the legal question, regardless of resolution, does not address the underlying distributional question.

The analogy to natural resource extraction is instructive but insufficient. When a mining company extracts minerals from public land, it pays a royalty for the right to extract. AI companies extract value from the training data — a publicly produced resource — and pay nothing for the right to do so. The fact that data extraction is non-rivalrous does not resolve the distributional question. The value captured by the AI company from the training data is real and substantial regardless of whether the data is depleted.

Mazzucato and Fausto Gernone addressed the creative labor dimension in their July 2025 analysis arguing that generative AI models are trained on publicly accessible creative content yet offer little to the artists, journalists, coders, and others who produce it. They proposed a levy on AI firms' revenues to fund the creative production on which AI depends — treating creative knowledge explicitly as a public good that requires collective funding.

Several mechanisms have been proposed for addressing this extraction. A training data dividend, modeled on the Alaska Permanent Fund, could distribute a share of AI company revenue to the public. A public licensing framework could require AI companies to pay fees for publicly produced training data, with fees directed to public research and education funds. A data commons trust would manage the public's interest in the training data collectively. Mazzucato's broader proposal for a digital windfall tax addresses the same asymmetry from a different angle.

Origin

The public good argument for training data emerged as courts and regulators began grappling with ongoing lawsuits over AI training and copyright — the Authors Guild letter, New York Times v. OpenAI, Andersen v. Stability AI, and dozens of similar cases filed between 2022 and 2025. The framing has roots in earlier debates over genetic commons, cultural commons, and the governance of shared resources that Elinor Ostrom's work on commons had already illuminated.

Mazzucato's specific application emerged through her collaboration with Fausto Gernone in 2024–2025 and her sustained critique of AI governance frameworks that treated training data as a raw material available for private appropriation.

Key Ideas

Civilizational scale extraction. The training corpus represents humanity's accumulated written output — a resource no private company produced.

Public institutional substrate. Schools, universities, libraries, and publishing infrastructure that produced the data were substantially public.

Non-rivalrous does not mean non-valuable. Extraction does not deplete the resource but does capture value that flows nowhere back to its producers.

Environmental dimension. The computational cost of training on public data is borne collectively through energy and water consumption while returns flow privately.

Imperfect mechanism superior to no mechanism. The absence of a perfect framework does not justify no framework — the mineral rights precedent shows that imperfect return-sharing is better than none.

Debates & Critiques

Technology industry advocates argue that training data extraction falls under fair use and that any compensation framework would damage AI development. Mazzucato's response is that fair use was designed for individual scholarly quotation, not trillion-token commercial ingestion, and that every previous industry built on public resources (mining, oil, broadcast spectrum) has eventually accepted some form of public return — to the benefit of both industry stability and public welfare.

Appears in the Orange Pill Cycle

Mariana Mazzucato — On AI