CONCEPT
Training Data as Pirated Labor
The Le
Guin volume's framing of AI training corpora as the unauthorized consumption of millions of creators' life's work — not merely a copyright dispute but a structural replication of colonial extraction, treating human creative output as free raw material for corporate capture.
Large language models are trained on text scraped from books, websites, forums, and repositories of human
expression — hundreds of billions of words written by millions of people across decades. The creators were not asked permission. They were not compensated. Their labor — the years of practice that produced the writing, the struggle with language that built their capacity — was consumed as training data, processed into statistical patterns, and now powers AI systems that generate text in response to prompts. Le Guin's own works, including "
The Ones Who Walk Away from
Omelas," were found in pirated datasets (Books3, others) used to train the models. The irony is almost too neat: the woman who warned that "the profit motive is often in conflict with the aims of art" had her art taken without permission to train the profit engines. The Le Guin volume treats this not as copyright infringement