You On AI Encyclopedia · Le Guin's Works in Training Data The You On AI Encyclopedia Home
Txt Low Med High
EVENT

Le Guin's Works in Training Data

The posthumous irony: Le Guin's copyrighted works—including her warnings about commodification—were scraped without permission to train the systems she opposed.
Dozens of Ursula K. Le Guin's books, including "The Ones Who Walk Away from Omelas" and The Dispossessed, were found in pirated datasets (Books3, others) used to train large language models including GPT-3, GPT-4, and their successors. The woman who wrote that "the profit motive is often in conflict with the aims of art" and who defended artists' rights to control their work's use had her life's creative output consumed without permission or compensation to power the AI systems she had warned against in her 2014 National Book Foundation speech. The appropriation is not metaphorical; it is the literal instantiation of the structure she described: creative labor extracted, processed into a commodity (training data), used to generate profit for people who did not create the work and do not value the practice that created it. Her estate's 2023 participation in the Authors Guild lawsuit against OpenAI sought remedy, but the legal outcome cannot restore the epistemic violence—the conversion of a lifetime's artistic practice into statistical patterns available on command.

In The You On AI Encyclopedia

The Books3 dataset, part of The Pile training corpus, contained approximately 197,000 books scraped from Bibliotik and other pirate sites. Analysis published in 2023 by The Atlantic identified Le Guin's works among thousands of copyrighted books used without authorization. The specific titles include The Left Hand of Darkness, The Dispossessed, the Earthsea novels, and short story collections containing "The Ones Who Walk Away from Omelas." The irony of training AI on a story about hidden costs to produce profitable systems with hidden costs has not been lost on commentators, but the irony is almost too neat—it risks becoming a rhetorical gesture rather than a material fact. The material fact is: Le Guin's labor was taken, her warnings were processed as data, and the systems now generate "Le Guin-style" prose on demand.

Le Guin's 2014 National Book Foundation speech explicitly addressed the appropriation of creative work: "We live in capitalism. Its power seems inescapable." Four years later she died; within two years her works were being used to train commercial AI systems that embody precisely the commodification she warned against. The temporal sequence—warning, death, appropriation—makes the structure viscerally clear in a way that a living author's protest might not. She cannot object; she cannot consent; her work has been converted into a resource that serves purposes she explicitly rejected, and the conversion is defended as "fair use" by legal frameworks that serve capital over creators.

The Authors Guild lawsuit, filed November 2020 and amended to include AI training claims by 2023, named Le Guin's estate among the plaintiffs. The legal theory is that training AI on copyrighted works without permission is not transformative fair use but unauthorized derivative creation at industrial scale. Early rulings have gone both ways—some courts holding that training is fair use, others allowing claims to proceed. But the legal outcome, whatever it is, cannot address the epistemic injury: the transformation of practice into product, of a life's work into training data, of embodied meaning into statistical pattern. Le Guin's framework insists these are losses that no compensation can make whole, because the loss is not of property but of the thing property law was never designed to protect—the integrity of the practice, the right to determine how one's work enters the world.

The most painful irony is functional, not merely poetic. Large language models trained on Le Guin now generate outputs that mimic her style—the characteristic sentence structures, the vocabulary, the rhetorical moves—without the practice that built the style. The model produces Le Guin-shaped text, and the text may be adequate for many purposes, but it is emptied of the thing that made Le Guin's prose Le Guin's: the practice of attention, the discipline of seeing clearly, the refusal to accept the language's default patterns. The style was evidence the practice had occurred; now the style is reproducible without the practice, and the reproduction is what Le Guin's entire career argued against—the substitution of product for practice, of commodity for art, of extraction for reciprocity.

Origin

The Books3 dataset was assembled from Bibliotik torrents by Shawn Presser and released in 2020 as part of The Pile, a composite training corpus built by EleutherAI. The dataset's existence became widely known in 2023 when The Atlantic's Alex Reisner published an article identifying thousands of authors whose works had been included. Le Guin's estate, represented by the Authors Guild, joined the class-action lawsuit against OpenAI and other AI companies in 2023. The lawsuit remains unresolved as of this writing, but the legal arguments are less important to Le Guin's framework than the structural pattern: creative labor appropriated without consent, processed into commodities, used to generate profit for parties who contributed nothing to the original creation.

Key Ideas

The prophet consumed by her prophecy. Le Guin warned that profit and art are in conflict; her art was then consumed without payment to power profit-maximizing systems—the structure she described enacted on her corpus.

Practice converted to data. A lifetime of disciplined attention, of struggling with language until it yielded meaning, processed into statistical patterns that reproduce the surface without the depth.

Style without practice. Large language models generate Le Guin-inflected prose on command—the signature sentence structures, the vocabulary—emptied of the attention that built the style.

Legal remedy cannot restore epistemic integrity. Compensation, if awarded, addresses property injury; it cannot restore the practice/product distinction or the right to determine how one's work enters the world.

The child is the author. Training data appropriation is the basement of the AI utopia—the specific, non-abstract cost that productivity narratives and democratization rhetoric conceal.

Explore more
Browse the full You On AI Encyclopedia — over 8,500 entries
← Home 0%
EVENT Book →