You On AI Field Guide · Training Data Expropriation The You On AI Field Guide Home
Txt Low Med High
CONCEPT

Training Data Expropriation

The AI industry's unauthorized use of collective knowledge as training material for proprietary models — Noble's framework reveals this not as technical necessity but as a specific political choice to treat the intelligence commons as an extraction opportunity.

Training data expropriation is the specific mechanism by which AI companies assemble their training corpora: aggregation of publicly accessible text, code, and media without consent from or compensation to the producers of the underlying content. The practice has been normalized through legal interpretation of fair use doctrine and through the implicit argument that the content was already public. Noble's framework reveals this normalization as a political operation: what makes the practice possible is not the technology but the institutional arrangement — the absence of data ownership frameworks, the weakness of collective bargaining structures for knowledge workers, the regulatory gap that allowed assembly to occur before norms could be established.

Training Data Expropriation
Training Data Expropriation

In The You On AI Field Guide

The empirical scale is substantial. Current frontier models are trained on corpora estimated at ten to twenty trillion tokens — a substantial fraction of all publicly accessible text ever written. The corpora include books (acquired through various means,

← Home 0%
CONCEPT Book →

Keep reading with YOU ON AI

Unlock the full book, field guide, and 555-thinker library. If you have a book code, register now — it takes a minute.

Register with book code Sign in