Training Data Expropriation — Orange Pill Wiki
CONCEPT

Training Data Expropriation

The AI industry's unauthorized use of collective knowledge as training material for proprietary models — Noble's framework reveals this not as technical necessity but as a specific political choice to treat the intelligence commons as an extraction opportunity.

Training data expropriation is the specific mechanism by which AI companies assemble their training corpora: aggregation of publicly accessible text, code, and media without consent from or compensation to the producers of the underlying content. The practice has been normalized through legal interpretation of fair use doctrine and through the implicit argument that the content was already public. Noble's framework reveals this normalization as a political operation: what makes the practice possible is not the technology but the institutional arrangement — the absence of data ownership frameworks, the weakness of collective bargaining structures for knowledge workers, the regulatory gap that allowed assembly to occur before norms could be established.

In the AI Story

Hedcut illustration for Training Data Expropriation
Training Data Expropriation

The empirical scale is substantial. Current frontier models are trained on corpora estimated at ten to twenty trillion tokens — a substantial fraction of all publicly accessible text ever written. The corpora include books (acquired through various means, including acknowledged use of pirated libraries like Library Genesis), newspapers and magazines (scraped without license from publishers), academic papers (often paywalled but accessible through various aggregation services), open-source code (licensed under terms the training use may or may not comply with), and social media posts, blogs, and forum discussions (contributed by individual users under terms of service that did not contemplate AI training).

The legal position of the major AI companies is that this use constitutes fair use under U.S. copyright law — that training on copyrighted works to produce a model that does not directly reproduce them is a transformative use analogous to reading. This position is contested in multiple active lawsuits (the Authors Guild, the New York Times, Getty Images, and others). The legal outcome will matter, but the structural question Noble's framework raises is prior to the legal question: who gets to decide how collective knowledge is used, and through what institutional processes?

The current answer is that AI companies decided unilaterally, assembled corpora without consultation, deployed models trained on those corpora, and are now using their deployed capabilities and the financial resources derived from them to defend the original decision in court. The procedural parallel is to the enclosure movement, which similarly established facts on the ground (fences erected, tenants evicted, commons converted) before legal and political responses could catch up.

The suppressed alternative — a consent-and-compensation framework for training data, with contributing communities governing the use of their collective output — is not speculative. Multiple proposals exist, from Lanier and Weyl's data dignity framework to various open-source licensing innovations to proposed regulatory frameworks in the EU AI Act. Their underdevelopment is not technical but political: the actors who would benefit from them lack the institutional power to establish them against the resistance of the actors who benefit from the current arrangement.

Origin

The framework develops across multiple contemporary analyses — Kate Crawford's Atlas of AI, Matteo Pasquinelli's work on AI's labor origins, the emerging legal scholarship on AI training and copyright, and the ongoing court cases that will shape the legal landscape. Noble's framework provides the analytical vocabulary for naming the structural dynamics that the legal discussion tends to obscure.

Key Ideas

Unilateral assembly. Training corpora were assembled without consultation with the producers of the underlying content; consent and compensation were not part of the process.

Legal defense after the fact. The fair use arguments defending the assembly came after the assembly had occurred, with the assembled capabilities providing the resources for the defense.

Structural enclosure. The practice converts collective knowledge into proprietary training material, reproducing the enclosure dynamics documented in agricultural and other commons conversions.

Alternative frameworks exist. Consent-and-compensation systems are not speculative; they are underdeveloped for political rather than technical reasons.

Debates & Critiques

AI company defenders argue that requiring consent and compensation for training data would make current-capability AI economically infeasible. The critical response concedes that it would change the economics while insisting that this is precisely the point: the current economics rest on an extraction that was not negotiated, and a renegotiation that redistributed value to the producers of the underlying knowledge would produce a different technology with different distributional properties — which is the structural question, not a side issue.

Appears in the Orange Pill Cycle

Further reading

  1. Kate Crawford, Atlas of AI (Yale, 2021)
  2. Jaron Lanier and E. Glen Weyl, "A Blueprint for a Better Digital Society" (Harvard Business Review, 2018)
  3. Matteo Pasquinelli, The Eye of the Master (Verso, 2023)
  4. Authors Guild et al. v. OpenAI, complaint filings (2023–)
  5. Mary Gray and Siddharth Suri, Ghost Work (Houghton Mifflin Harcourt, 2019)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT