CONCEPT

The Training Corpus Question

The central unresolved legal and ethical problem of AI production — whether the use of copyrighted works as training data constitutes infringement, a new form of enclosure, or something the existing copyright framework cannot adequately classify.

The most consequential legal question of the AI era is not whether machines can be authors. It is who owns the tradition. Every large language model is trained on a corpus — the accumulated deposit of human textual production, layer upon layer, from which the model extracts statistical patterns that enable it to generate new text. The legal question is whether the use of copyrighted works as training data constitutes infringement — whether the extraction of patterns from a copyrighted text, without reproducing the text itself, violates the author's exclusive rights. The question has no settled answer. Courts in multiple jurisdictions are considering it. And the question cannot be adequately analyzed within the Romantic framework that currently governs copyright, because the framework was designed for a world in which the primary economic threat was unauthorized reproduction.

In The You On AI Field Guide

The training corpus is not a database in the conventional sense

In The You On AI Field Guide

Keep reading with YOU ON AI