CONCEPT
The Training Corpus Question
The central unresolved legal and ethical problem of AI production — whether the use of copyrighted works as training data constitutes infringement, a new form of
enclosure, or something the existing
copyright framework cannot adequately classify.
The most consequential legal question of the AI era is not whether machines can be authors. It is who owns the tradition. Every large language model is trained on a corpus — the accumulated deposit of human textual production, layer upon layer, from which the model extracts statistical patterns that enable it to generate new text. The legal question is whether the use of copyrighted works as training data constitutes infringement — whether the extraction of patterns from a copyrighted text, without reproducing the text itself, violates the author's exclusive rights. The question has no settled answer. Courts in multiple jurisdictions are considering it. And the question cannot be adequately analyzed within the
Romantic framework that currently governs copyright, because the framework was designed for a world in which the primary economic threat was unauthorized reproduction.
In The You On AI Field Guide
The training corpus is not a database in the conventional sense