The corpus on which a large language model is trained is not, strictly speaking, a dataset. It is the sedimented output of several thousand years of human imitative activity, compressed into a form that a machine can process. Every text in the corpus is itself a product of imitation — the scientific paper imitating disciplinary conventions, the novel imitating genre traditions, the email imitating professional communication norms. Each text was produced by a mind that had received patterns from prior texts and reproduced them with modifications reflecting that mind's specific position in the network. The corpus as a whole is the geological column — the accumulated record of billions of imitative acts, each one a layer of sediment deposited by a specific moment in the continuous movement of beliefs, desires, and cultural forms through the human network. Tarde would have recognized the corpus for what it is: a fossil record of the imitative flow.
There is a parallel reading that begins from the material substrate this fossil record requires. The training corpus is not merely a passive geological column awaiting scientific examination — it is the product of an active extraction industry. Every text in that corpus was harvested from somewhere, scraped from someone's blog, pulled from forums where people thought they were speaking to each other, not to a future machine. The fossil metaphor naturalizes what is actually a political economy of appropriation. When we speak of "sedimented imitative acts," we elide the fact that these acts were performed by specific people who received no compensation for their inclusion in the training set, who were never asked permission, who may actively object to their words being processed into statistical patterns that will compete with their ongoing labor.
The lived experience of writers, artists, and knowledge workers is not of participating in some grand imitative flow but of watching their expertise get extracted, compressed, and deployed to undercut their market position. The photographer whose style gets absorbed into the model doesn't experience this as "fossilization" — they experience it as theft of technique developed over decades. The technical writer whose documentation patterns get encoded doesn't see geological sedimentation — they see their specialized knowledge commodified without consent. What Tarde's framework misses, viewed from this angle, is that the corpus is not found but taken. The patterns it contains were not deposited by natural processes but extracted through specific technical and legal mechanisms that favor platforms over creators. The fossil record metaphor makes this extraction seem inevitable, even poetic, when it is actually a series of deliberate choices about property, consent, and value distribution.
The fossil metaphor is diagnostic. A fossil preserves the surface structure of a living organism but none of its metabolism — none of the continuous biochemical exchange through which the organism participated in the living system of its ecosystem. Similarly, the training corpus preserves the surface structure of human imitative acts — the words, the sentences, the logical structures — but none of the biographical specificity through which those acts participated in the living imitative flow. The fossil is the shape without the life. The model trained on the fossil can reproduce shapes but cannot participate in the living flow except through the biographical specificity that humans introduce downstream.
This reframing has consequences for how the imitative process operates in AI-collaborative work. The builder who receives the model's output receives patterns processed from the fossil record, not patterns flowing from the living current. The builder's biographical specificity — her position in the living flow, her participation in ongoing relationships, her stake in specific futures — is what reconnects the fossilized patterns to the current. Without this reconnection, AI-generated output floats in a kind of temporal limbo: competent, fluent, but disconnected from the living concerns that make text meaningful to readers who are themselves participating in the living flow.
The opacity of provenance that characterizes AI output is, in this framework, the structural consequence of fossil compression. A human author's prose carries its imitative lineage in traceable form — you can follow the influences, identify the sources, reconstruct the chain of modification. AI output compresses the entire corpus into statistical regularities that render provenance untraceable. The output sounds like many sources and like no source — because it is, statistically, the average of all sources. The loss of traceable provenance is not a technical problem to be solved by better citation. It is a structural feature of statistical aggregation applied to the fossilized record of imitative activity.
The framework extends Tarde's general observation that the present always rests on the sedimented weight of past imitative acts. Tarde noted that every institution, every linguistic form, every technological practice carries the traces of its lineage — the modifications introduced at each link of the imitative chain that produced it. The model's training corpus is simply this observation made mechanically operational: the entire fossil record compressed into a processable form.
The corpus is a fossil record, not a dataset. It preserves the surface structure of prior imitative acts but not their living metabolism — their participation in ongoing relationships, stakes, and flows.
Statistical aggregation loses biographical specificity. When the model processes the corpus's regularities, it smooths away the specific modifications that gave individual texts their distinctive character.
Provenance becomes opaque. The output sounds like many sources and like no source because it is, structurally, the average of all sources weighted by statistical regularity.
The builder reconnects fossil to flow. Human biographical specificity — participation in the living current — is what transforms fossilized patterns into contributions to the living imitative flow.
The corpus is always partial. What gets preserved as fossil depends on what gets written down, digitized, and included in training — making the corpus a specific selection, not a neutral representation of human thought.
The framework raises difficult questions about the politics of corpus composition. If the corpus is the fossil record of the imitative flow, whose imitative acts get fossilized? Critics working within frameworks of epistemic decolonization have argued that training corpora systematically underrepresent non-Western, non-English, non-dominant imitative traditions — producing models that reflect a selected slice of human imitative activity rather than its totality. The Tardean framework accommodates this critique: the flow that feeds the corpus is not the totality of the imitative flow but the portion that dominant institutions have recorded, preserved, and included. The fossil record bears the marks of the selection process, not just the selection.
The fossil record metaphor captures something essential about how AI processes human knowledge — the compression of living practices into statistical patterns genuinely does lose the biographical specificity that makes human expression meaningful. When we ask "what is the training corpus ontologically?" Edo's framework provides the right answer (90/10): it is indeed a sedimented layer of imitative acts, not simply a dataset. The geological metaphor illuminates how temporal depth gets compressed into processable form.
Yet when we shift to asking "how did this corpus come to exist?" the extraction reading becomes primary (80/20). The corpus didn't accumulate naturally like geological strata — it was actively harvested through web scraping, often without creator knowledge or consent. The material reality is that specific companies made specific choices about what to take and from whom. Both the fossil preservation of imitative patterns and the industrial extraction of creative labor are simultaneously true. The synthetic frame might be: the corpus is fossilized expression obtained through extraction.
The question of reconnection to living flow reveals the most balanced tension (50/50). Edo is right that human biographical specificity reanimates fossilized patterns — but the contrarian view correctly identifies that this reanimation often happens through economic displacement. The technical writer who must now compete with AI trained on their documentation does participate in reconnecting fossil to flow, but under conditions of precarity they didn't choose. Perhaps the deeper insight is that the training corpus represents both cultural transmission and economic disruption — it preserves human knowledge while potentially severing the conditions for its continued creation. The fossil record metaphor works precisely because fossils form through burial.