CONCEPT

The Training Corpus Question

The central unresolved legal and ethical problem of AI production — whether the use of copyrighted works as training data constitutes infringement, a new form of enclosure, or something the existing copyright framework cannot adequately classify.

The most consequential legal question of the AI era is not whether machines can be authors. It is who owns the tradition. Every large language model is trained on a corpus — the accumulated deposit of human textual production, layer upon layer, from which the model extracts statistical patterns that enable it to generate new text. The legal question is whether the use of copyrighted works as training data constitutes infringement — whether the extraction of patterns from a copyrighted text, without reproducing the text itself, violates the author's exclusive rights. The question has no settled answer. Courts in multiple jurisdictions are considering it. And the question cannot be adequately analyzed within the Romantic framework that currently governs copyright, because the framework was designed for a world in which the primary economic threat was unauthorized reproduction.

In the AI Story

Hedcut illustration for The Training Corpus Question — The Training Corpus Question

The training corpus is not a database in the conventional sense — not a structured collection of discrete records that can be individually retrieved. It is something more like a landscape: layer upon layer of human writing, from which the model extracts statistical patterns. The model does not memorize individual texts. It learns the patterns that connect them — the regularities of syntax, argument, narrative, and expression that constitute the deep structure of human written communication.

The existing idea-expression framework cannot sort the training use cleanly. Is the model using the ideas in a copyrighted text (permitted) or the expression (not permitted)? The statistical learning process operates on both simultaneously, extracting patterns that are neither purely idea nor purely expression.

The enclosure analogy illuminates the distributional dimension. Just as English common lands were progressively privatized through the enclosure acts, the textual commons is being enclosed through AI training: the accumulated intellectual labor of millions of creators is extracted without consent or compensation, and the value accrues primarily to the companies that build and deploy the models.

Current litigation — notably The New York Times v. OpenAI — presses the tension into the legal system. The outcomes will shape the economics of AI for decades and may require the development of entirely new legal categories that the existing framework cannot supply. Collective licensing schemes, training data royalties, and statutory compulsory licenses are among the proposed responses.

Origin

The question emerged as a major public issue in 2022–2023 with the release of widely-used large language models and the first lawsuits alleging infringement through training use. Prior instances — smaller-scale text corpus uses for academic research, for example — had not raised the same distributional concerns because the extraction was not commercial or because the scale was limited.

Woodmansee anticipated the structural issue in her 1992 observation that electronic communication was assaulting the distinction between mine and thine the authorship construct was designed to enforce. The AI training corpus intensifies that assault: it does not merely blur the distinction between one writer's text and another's; it dissolves the boundary by making every text a potential ingredient in a statistical model whose output bears no individual fingerprint.

Key Ideas

Pattern extraction versus reproduction. The training process does not reproduce individual texts; it extracts statistical patterns from the aggregate. The existing copyright framework regulates reproduction, not pattern extraction, and the gap between what it regulates and what the technology does is the space in which the question operates.

Aggregate value, individual rights. The training corpus has value precisely because of its aggregation. No individual text is sufficient to train a capable model. The value is a property of the collection, not of any element within it — a structural feature the individual-rights copyright framework cannot capture.

Enclosure without compensation. The economic structure of the AI industry captures the value of the textual commons without compensating the creators whose work constitutes it. The structural injustice is real and is not addressed by extending existing individual-rights frameworks.

Collective mechanisms required. Adequate responses — collective licensing, training royalties, statutory compulsory licenses, data trusts — all operate at the level of the commons rather than at the level of the individual work. The existing legal apparatus is not well-suited to collective mechanisms, and building them requires institutional innovation.

The Woodmansee-Jaszi warning. In 1994 the authors of The Construction of Authorship wrote that copyright may be inapposite to the realities of cultural production. The training corpus question is the starkest confirmation of that warning.

Debates & Critiques

Positions range widely. AI companies argue that training use is fair use — transformative, non-reproductive, productive of new capability. Creator organizations argue that training use is infringement at scale, extracting commercial value from protected works without license. Scholars propose various middle paths: compulsory licensing regimes, data trusts, collective bargaining mechanisms, or statutory modification of fair-use doctrine. The legal resolution remains uncertain; the philosophical resolution requires the reconstruction of copyright's foundations that Woodmansee's framework makes possible.

Appears in the Orange Pill Cycle

Martha Woodmansee — On AI