Opaque Provenance — Orange Pill Wiki
CONCEPT

Opaque Provenance

The structural property of large language model outputs by which assertions cannot be traced to specific sources — producing a form of epistemic fragility that inverts the preservative powers of print.

Opaque provenance names the structural feature of AI-generated content by which an assertion cannot be traced to the specific sources that produced it. A printed book preserves a text: the text can be read, cited, verified, corrected, and argued about by anyone who holds a copy, and the provenance of every claim can be traced to a specific author, edition, and page. A large language model preserves something different — a statistical compression of millions of texts in which the model's 'knowledge' is not the texts themselves but an abstraction from them, a lossy compression that retains patterns while discarding the specific evidence from which the patterns were derived. When an AI system asserts something, the assertion cannot be traced to a source in the way a claim in a printed book can be traced to a citation. The user cannot inspect the evidence, evaluate the reliability of the sources, or distinguish between an assertion derived from peer-reviewed research and one derived from a forum post.

In the AI Story

Hedcut illustration for Opaque Provenance
Opaque Provenance

The contrast with print is instructive because it reveals what cumulative knowledge actually requires. Eisenstein argued that the preservative powers of print — the redundancy created by distributed identical copies — made knowledge effectively indestructible for the first time in history. But preservation alone is not enough. The knowledge has to be traceable for cumulative building to occur. Each generation's ability to evaluate, correct, and extend the previous generation's work depends on access to the evidence on which claims are based. Evaluation requires access; correction requires the ability to identify errors and trace them to sources; extension requires confidence that foundations are sound.

Opaque provenance compromises all three operations. When the evidence is encoded in statistical patterns whose relationship to sources is not preserved, the user cannot evaluate the claim's foundation, cannot identify which sources contributed to an error, and cannot extend the work with confidence. The problem is not that the knowledge is wrong — often it is approximately correct — but that its epistemic status is unclear in a way that makes cumulative building difficult.

Segal's 'Deleuze error' from The Orange Pill is the paradigm case. Claude generated a passage attributing a concept to Gilles Deleuze that 'worked rhetorically,' 'sounded right,' and 'felt like insight' — but upon examination misrepresented Deleuze's actual position. The error was invisible because the surface quality was so polished: what Segal calls 'confident wrongness dressed in good prose.' The assertion could not be traced to a source because the model does not preserve source mappings. The polish of the output was precisely what made the error hard to catch.

The opacity is compounded by concentration. Training corpora are controlled by a small number of corporations. Decisions about what to include, exclude, and how to weight different sources are made by corporate teams whose deliberations are not subject to public scrutiny. The user who relies on an AI's output is relying on decisions she did not participate in, cannot inspect, and has no mechanism to contest. This is a departure from print in exactly the direction Eisenstein's framework would identify as consequential: print distributed control across thousands of independent printers; AI concentrates it in a handful of corporations.

Origin

The concept of opaque provenance is not original to the Eisenstein volume; it has emerged from multiple disciplines converging on the problem. Scholars of digital humanities and information science have discussed 'source attribution' and 'training data transparency' since the early 2020s. Bruno Latour's actor-network theory provides the conceptual vocabulary for thinking about how networks of evidence produce knowledge claims.

The specific formulation in the Eisenstein volume draws the connection between opaque provenance and Eisenstein's framework of preservation, fixity, and cumulative knowledge. The argument is that the AI transition compromises cumulative inquiry not because the underlying texts are at risk of loss — they are not — but because the model's relationship to those texts is structurally opaque.

Key Ideas

Preservation without traceability. AI preserves statistical patterns but loses the mappings between claims and sources.

Evaluation requires provenance. Cumulative knowledge-building depends on the ability to inspect the evidence behind assertions.

The Deleuze error as paradigm. Confident, fluent, plausible output can misrepresent sources in ways that are invisible precisely because the surface is polished.

Opacity compounds concentration. Centralized training data and proprietary models mean users cannot inspect, contest, or correct the knowledge base.

Inverts print's preservative powers. Where print made knowledge both preserved and traceable, AI makes knowledge abundantly preserved but opaquely sourced.

Debates & Critiques

Whether opaque provenance is an intrinsic property of large language models or an addressable technical problem is contested. Some researchers argue that retrieval-augmented generation, citation mechanisms, and explicit source tracking can in principle restore provenance to AI outputs. Others argue that the statistical nature of the compression makes genuine source attribution structurally impossible, because the model's 'knowledge' is not decomposable into source contributions. The institutional question — whether mechanisms for provenance will be built into AI systems or required by regulation — is more consequential than the technical one.

Appears in the Orange Pill Cycle

Further reading

  1. Elizabeth Eisenstein, The Printing Press as an Agent of Change, vol. 1, ch. 2 (Cambridge University Press, 1979)
  2. Emily M. Bender et al., 'On the Dangers of Stochastic Parrots,' Proceedings of FAccT '21 (2021)
  3. Timnit Gebru et al., 'Datasheets for Datasets,' Communications of the ACM 64, no. 12 (2021)
  4. Shannon Mattern, A City Is Not a Computer (Princeton University Press, 2021)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT