The Training Data Question — Orange Pill Wiki
CONCEPT

The Training Data Question

The governance regime change in which the accumulated textual, visual, and computational output of millions of individuals was appropriated for AI training under terms their original contribution did not contemplate — the paradigmatic case of commons appropriation without community participation.

The training data from which large language models learn constitutes, in institutional-economic terms, a commons: the accumulated textual, visual, and computational output of millions of individuals, contributed without explicit governance arrangements for this purpose to a shared pool from which value is now extracted by a small number of firms. The governance arrangements under which the data was originally contributed — the norms of the open internet, the terms of service of social platforms, the licensing frameworks of academic publishing — were designed for a world in which the data's primary use was human consumption. The appropriation of that data for AI training represents what Ostrom's framework identifies as a regime change in the commons.

In the AI Story

Hedcut illustration for The Training Data Question
The Training Data Question

The appropriation was undertaken without the participation of the community whose contributions constitute the resource. The herders did not consent to having the pasture enclosed. The fishers did not agree to the sale of commercial licenses. The contributors to the training-data commons did not participate in the decision to use their contributions for purposes that the original governance arrangements did not address.

Max Fang's February 2025 Stanford working paper, "The Tragedy of the AI Data Commons," employs law-and-economics methodologies alongside Ostrom's design principles to frame precisely this dynamic. The conventional response follows Hardin's dichotomy. One camp argues for privatization: clear property rights over data, licensing regimes, compensation mechanisms. The other argues for state regulation: government-mandated data governance, algorithmic auditing, transparency requirements.

Both responses have merit and limitations. Privatization encounters the practical difficulty that the data was not produced as property — it was produced as communication, expression, participation in a shared informational environment — and retroactively imposing property frameworks creates distortions that may exceed the problem they address. State regulation encounters the enforcement challenges of regulating global digital systems through national legal frameworks.

Ostrom's framework suggests a third approach: governance arrangements developed by the communities whose contributions constitute the resource. The Mozilla Foundation, collaborating with the Ostrom Workshop at Indiana University, has developed a practical framework for applying the design principles to data commons governance. The practical challenges are significant — contributors number in the millions across every jurisdiction on earth, with no pre-existing organizational structure — but not unprecedented in Ostrom's empirical record.

Origin

The question crystallized around 2022–2023 as the commercial value of large language models became unambiguous and the provenance of their training data became litigation-relevant. The framing as a commons-appropriation question, rather than as a property-rights or regulatory question, emerged from the application of Ostrom's framework by scholars at the Ostrom Workshop and related research programs.

Key Ideas

Regime change in the commons. Data contributed under one set of governance assumptions was appropriated under another, without the participation of the contributing community.

False binary. Privatization and state regulation both have merit, but neither resolves the core question of who makes the governance arrangements.

Third option exists. Ostrom's framework supports community-based governance arrangements developed by the contributors themselves.

Practical challenges real. Scale, jurisdictional diversity, and absent organizational structure make this harder than any commons Ostrom studied — but not impossible.

Appears in the Orange Pill Cycle

Further reading

  1. Max Fang, "The Tragedy of the AI Data Commons" (Stanford working paper, 2025)
  2. Mozilla Foundation and Ostrom Workshop, data commons governance framework
  3. Hess and Ostrom, Understanding Knowledge as a Commons (2007)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT