CONCEPT
Training Data as Public Good
The conversion of humanity's accumulated written output — produced over centuries, sustained by public education and research — into private proprietary value, without compensation flowing back to the public that produced the resource.
The training data on which large language models depend represents what may be the most consequential conversion of a public good into private value in economic history. Trillions of tokens drawn from books, scientific papers, encyclopedias, government publications, legal documents, academic theses — virtually every form of written expression that has been digitized. This text was not produced by the companies that train their models on it. It was produced by billions of people over centuries, working within institutions substantially funded by public investment: publicly funded schools and universities, government-funded research programs, public libraries and archives, the publicly built internet. Mazzucato's framework names what is happening:
value extraction at civilizational scale. The creation happened across centuries of human intellectual production. The extraction happens when that production is ingested, processed through neural network architectures, and converted into a proprietary model whose capabilities reflect the accumulated knowledge of the training data but whose ownership resides entirely with the company that performed the conversion.