You On AI Encyclopedia · Training Data as Pirated Labor The You On AI Encyclopedia Home
Txt Low Med High
CONCEPT

Training Data as Pirated Labor

The Le Guin volume's framing of AI training corpora as the unauthorized consumption of millions of creators' life's work — not merely a copyright dispute but a structural replication of colonial extraction, treating human creative output as free raw material for corporate capture.
Large language models are trained on text scraped from books, websites, forums, and repositories of human expression — hundreds of billions of words written by millions of people across decades. The creators were not asked permission. They were not compensated. Their labor — the years of practice that produced the writing, the struggle with language that built their capacity — was consumed as training data, processed into statistical patterns, and now powers AI systems that generate text in response to prompts. Le Guin's own works, including "The Ones Who Walk Away from Omelas," were found in pirated datasets (Books3, others) used to train the models. The irony is almost too neat: the woman who warned that "the profit motive is often in conflict with the aims of art" had her art taken without permission to train the profit engines. The Le Guin volume treats this not as copyright infringement (a legal category that may or may not apply, depending on courts) but as a structural relationship: the AI industry relates to creative labor the way colonial industries related to indigenous resources — as inputs available for extraction, whose producers' consent is not required, whose value is determined by the extractor rather than the producer.

In The You On AI Encyclopedia

The legal question is unsettled. Recent rulings (2024–2025) have held that using copyrighted books to train AI models constitutes fair use under U.S. law, a determination that permits the practice while not addressing the justice of the practice. Fair use is a legal category answering "is this allowed?" not "is this right?" Le Guin's framework insists on the second question. The books were written by people who spent years, often decades, building the capacity to write them. The capacity is not natural endowment; it is developmental achievement, built through the practice Le Guin described throughout her career. The practice is labor — cognitive, imaginative, emotionally costly labor that produces the sentences, paragraphs, and narrative structures that training data consists of. Treating this labor as free raw material is not a copyright question. It is a labor question, a commons question, a question about who captures the value created by accumulated human effort.

The colonial analogy is precise, not rhetorical. Colonial extraction operated by treating indigenous resources (land, minerals, labor) as available for European appropriation. The legal architecture (property law recognizing only European-style titles, labor law permitting indenture) made the extraction legal. The extraction was still theft — not because the law said so (the law said the opposite) but because the producers were not compensated, were not asked, and often were destroyed by the taking. AI training data extraction follows the same structure: the legal architecture (copyright law written before anyone imagined this use case) may permit the taking. The taking is still consumption of labor without compensation, and the fact that the labor is cognitive rather than physical does not make the relationship less extractive. Le Guin would have recognized the pattern instantly: the powerful take what they need from the less powerful, the legal system legitimizes the taking, and the taken-from are told they should be grateful for the opportunity to contribute to progress.

The Books3 dataset (used to train GPT-3, LLaMA, and other models) contained over 190,000 books, most pirated from Library Genesis and other shadow libraries. Authors — including Le Guin, whose works appeared in multiple training corpora — were not notified. Internal communications at AI companies (released through discovery in ongoing lawsuits) show researchers knew the material was pirated and proceeded anyway, justified by the logic of competition: if we don't use it, our competitors will, and we cannot build competitive models on legally-licensed text alone (the corpus is too small, too biased, too expensive). The justification is rational within the weapon narrative (competitive necessity, move fast, acceptable risk). It is theft within the carrier bag narrative (whose labor filled the bag? were they compensated? did anyone ask?). Le Guin's framework makes visible that these are not two descriptions of the same act but two moral realities determined by framework choice.

The deeper violation is not the unauthorized copying (legal systems can and will resolve that) but the conversion of practice into commodity. The books in the training corpus are the residue of practice — Le Guin's sixty years of learning to write, learning to see, learning to think in the specific ways that her novels demonstrate. The practice is inseparable from the person who practiced it. The commodity (the text) is what the AI consumed, and the consumption erases the practice, erases the person, treats the accumulated lifetime of disciplined attention as a data source whose origin is irrelevant. Le Guin argued throughout her career that this erasure is the defining violence of market logic: it sees the object, not the maker; the output, not the practice; the commodity, not the art. The training data extraction is market logic achieving its purest form: the total separation of product from producer, made possible by a technology that can replicate the surface of creative work without the need for the creative worker.

Origin

The issue became visible in 2023 when datasets used to train major models (GPT-3, LLaMA, others) were publicly documented. Researchers found Books3 (a pirated collection), WebText, and other corpora containing copyrighted material used without permission. Authors including Margaret Atwood, Jhumpa Lahiri, Jonathan Franzen, and Le Guin's estate protested. Class-action lawsuits followed (Authors Guild, individual authors, visual artists). The legal question is whether training constitutes fair use (courts are split; the question remains unsettled as of 2026). The moral question is what the Le Guin volume addresses: whether the unauthorized consumption of creative labor to train commercial systems is theft, regardless of what copyright law concludes. Le Guin died (2018) before the issue became public, but her essays on commodification and her entire corpus's insistence on seeing the labor behind the product provide the framework that makes the consumption visible as consumption rather than as "progress."

Key Ideas

Training data is labor, not resource. The books in Books3 represent millions of hours of practice by people who spent careers learning to write — treating this as free raw material is consuming labor without compensation, structurally identical to colonial resource extraction.

Legal permissibility is not moral justification. Courts may rule training is fair use; the ruling does not address whether the practice is just — law answers "is this allowed," ethics answers "is this right," and the two questions are not equivalent.

Le Guin's work was pirated to train the system. The irony (her warnings about commodification were commodified, her art was consumed to power the engines she opposed) is not accidental but diagnostic — the market logic she identified operates without regard for the content of what it consumes.

The erasure of the maker. AI training separates product (text) from producer (writer's decades of practice) completely — the consumption treats the accumulated lifetime of disciplined attention as a data source whose origin is irrelevant, which is the defining operation of commodity logic.

The colonial structure. Powerful actors take what they need from less powerful producers, the legal system legitimizes the taking, the taken-from are told they benefit from contributing to progress — the structure is extraction, regardless of what the law permits or the extractors intend.

The creditor class. Creators whose work trained the models are owed something — not as charity but as debt, because the AI industry's gains were purchased by consuming their labor, and the purchase was not a transaction (which requires consent) but an appropriation.

Explore more
Browse the full You On AI Encyclopedia — over 8,500 entries
← Home 0%
CONCEPT Book →