You On AI Encyclopedia · The Training Data Question The You On AI Encyclopedia Home
Txt Low Med High
CONCEPT

The Training Data Question

The governance regime change in which the accumulated textual, visual, and computational output of millions of individuals was appropriated for AI training under terms their original contribution did not contemplate — the paradigmatic case of commons appropriation without community participation.
The training data from which large language models learn constitutes, in institutional-economic terms, a commons: the accumulated textual, visual, and computational output of millions of individuals, contributed without explicit governance arrangements for this purpose to a shared pool from which value is now extracted by a small number of firms. The governance arrangements under which the data was originally contributed — the norms of the open internet, the terms of service of social platforms, the licensing frameworks of academic publishing — were designed for a world in which the data's primary use was human consumption. The appropriation of that data for AI training represents what Ostrom's framework identifies as a regime change in the commons.
The Training Data Question
The Training Data Question

In The You On AI Encyclopedia

The appropriation was undertaken without the participation of the community whose contributions constitute the resource. The herders did not consent to having the pasture enclosed. The fishers did not agree to the sale of commercial licenses. The contributors to the training-data commons did not participate in the decision to use their contributions for purposes that the original governance arrangements did not address.

Max Fang's February 2025 Stanford working paper, "The Tragedy of the AI Data Commons," employs law-and-economics methodologies alongside Ostrom's design principles to frame precisely this dynamic. The conventional response follows Hardin's dichotomy. One camp argues for privatization: clear property rights over data, licensing regimes, compensation mechanisms. The other argues for state regulation: government-mandated data governance, algorithmic auditing, transparency requirements.

Knowledge Commons
Knowledge Commons

Both responses have merit and limitations. Privatization encounters the practical difficulty that the data was not produced as property — it was produced as communication, expression, participation in a shared informational environment — and retroactively imposing property frameworks creates distortions that may exceed the problem they address. State regulation encounters the enforcement challenges of regulating global digital systems through national legal frameworks.

Ostrom's framework suggests a third approach: governance arrangements developed by the communities whose contributions constitute the resource. The Mozilla Foundation, collaborating with the Ostrom Workshop at Indiana University, has developed a practical framework for applying the design principles to data commons governance. The practical challenges are significant — contributors number in the millions across every jurisdiction on earth, with no pre-existing organizational structure — but not unprecedented in Ostrom's empirical record.

Origin

The question crystallized around 2022–2023 as the commercial value of large language models became unambiguous and the provenance of their training data became litigation-relevant. The framing as a commons-appropriation question, rather than as a property-rights or regulatory question, emerged from the application of Ostrom's framework by scholars at the Ostrom Workshop and related research programs.

Key Ideas

Regime change in the commons. Data contributed under one set of governance assumptions was appropriated under another, without the participation of the contributing community.

Tragedy of the AI Data Commons
Tragedy of the AI Data Commons

False binary. Privatization and state regulation both have merit, but neither resolves the core question of who makes the governance arrangements.

Third option exists. Ostrom's framework supports community-based governance arrangements developed by the contributors themselves.

Practical challenges real. Scale, jurisdictional diversity, and absent organizational structure make this harder than any commons Ostrom studied — but not impossible.

Further Reading

  1. Max Fang, "The Tragedy of the AI Data Commons" (Stanford working paper, 2025)
  2. Mozilla Foundation and Ostrom Workshop, data commons governance framework
  3. Hess and Ostrom, Understanding Knowledge as a Commons (2007)

Three Positions on The Training Data Question

From Chapter 15 — how the Boulder, the Believer, and the Beaver each read this concept
Boulder · Refusal
Han's diagnosis
The Boulder sees in The Training Data Question evidence of the pathology — that refusal, not adaptation, is the correct posture. The garden, the analog life, the smartphone that is not bought.
Believer · Flow
Riding the current
The Believer sees The Training Data Question as the river's direction — lean in. Trust that the technium, as Kevin Kelly argues, wants what life wants. Resistance is fear, not wisdom.
Beaver · Stewardship
Building dams
The Beaver sees The Training Data Question as an opportunity for construction. Neither refuse nor surrender — build the institutional, attentional, and craft governors that shape the river around the things worth preserving.

Read Chapter 15 in the book →

Explore more
Browse the full You On AI Encyclopedia — over 8,500 entries
← Home 0%
CONCEPT Book →