Training data expropriation is the specific mechanism by which AI companies assemble their training corpora: aggregation of publicly accessible text, code, and media without consent from or compensation to the producers of the underlying content. The practice has been normalized through legal interpretation of fair use doctrine and through the implicit argument that the content was already public. Noble's framework reveals this normalization as a political operation: what makes the practice possible is not the technology but the institutional arrangement — the absence of data ownership frameworks, the weakness of collective bargaining structures for knowledge workers, the regulatory gap that allowed assembly to occur before norms could be established.
The empirical scale is substantial. Current frontier models are trained on corpora estimated at ten to twenty trillion tokens — a substantial fraction of all publicly accessible text ever written. The corpora include books (acquired through various means,