CONCEPT

Genealogy of Data

Daston's historical reconstruction of the concept of 'data' — revealing that what we now treat as things found in the world was originally things given as starting points for argument, and that the transformation required the construction of elaborate institutional systems.

The word 'data' comes from the Latin dare, meaning 'to give.' In its original seventeenth-century usage, data referred not to things found but to things given — the premises of an argument, the starting points that were granted rather than discovered. A mathematician might stipulate: given these axioms, derive these conclusions. Given functioned as a verb of bestowal, not of observation. The transformation of data from philosophical premises into empirical observations — from things given to things found — occurred over centuries and required the construction of elaborate systems of collection, classification, and standardization. Daston's genealogical work shows that the contemporary concept of data as raw material gathered from the world is the endpoint of this historical process, not its starting point, and that it carries the institutional assumptions of the systems that produced it.

In the AI Story

Hedcut illustration for Genealogy of Data — Genealogy of Data

The genealogy matters because it reveals something systematically concealed in contemporary usage: data is never simply found. What counts as data, what is collected, how it is categorized, in what formats it is preserved — all of these reflect historically specific decisions by institutions with particular purposes. The census categories that seemed natural in one decade become visible as contingent constructions in the next. The medical records that recorded certain diagnoses while omitting others reflect the taxonomies of a specific moment in the history of medicine. The digitization priorities of one decade determine what is machine-readable in the next.

Daston's research on the history of statistics, particularly in Classical Probability in the Enlightenment (1988), documented the institutional work that transformed scattered observations into what we now call data. Bureaus of statistics had to be established. Categories had to be negotiated. Standardized forms had to be designed and distributed. The entire infrastructure of modern quantitative knowledge — the tables, the ratios, the cross-tabulations — depends on prior institutional decisions about what to count and how, and these decisions are not visible in the resulting numbers.

The AI training corpus inherits all of these contingencies, amplified by an additional layer. Training data is shaped by which institutions digitized their holdings, which languages dominated the internet at the time of collection, which formats were machine-readable, which economic structures determined what was worth preserving in digital form. When an AI generates a response, it is not drawing on the sum of human knowledge. It is drawing on a historically specific, institutionally mediated, materially constrained subset of that knowledge, formatted for machine consumption. The subset's biases are not random noise; they are patterned reflections of the decisions that produced the corpus.

Understanding AI errors as genealogically produced — as products of a specific history rather than random failures of a general system — changes the calibration challenge. The question is no longer simply 'when is AI reliable?' but 'what specific historical, institutional, and material conditions produced the data on which this system was trained, and what systematic biases do those conditions introduce?' The question requires not only technical competence but historical and institutional literacy — a combination of competencies that no existing educational program systematically provides.

Origin

The genealogical approach to data is most fully developed in Daston's contributions to the 2017 volume Science in the Archives (edited with Elaine Leong) and in her earlier Classical Probability in the Enlightenment (1988). The method draws on Foucault's genealogical approach while applying it with characteristic precision to the specific history of scientific concepts rather than to broader discursive formations.

The approach has proven particularly productive in analyzing the social and political conditions that produce quantitative knowledge. Theodore Porter's Trust in Numbers (1995), Mary Poovey's A History of the Modern Fact (1998), and Lorraine Daston's own work constitute a linked tradition that has transformed how historians and social scientists understand the origins of the quantitative frameworks within which contemporary knowledge production operates.

Key Ideas

Data originally meant 'things given.' The seventeenth-century Latin usage referred to premises stipulated at the start of an argument, not to observations gathered from the world.

The transformation to 'things found' required institutions. Bureaus of statistics, standardized forms, agreed-upon categories, and distributed collection infrastructure — none of which appear in the resulting numbers.

Category choices are invisible in outputs. Once data exists, the contingent decisions that produced it become difficult to see, and the resulting numbers appear as natural features of the world.

AI training data inherits institutional biases. What was digitized, in what language, in what format, reflects specific institutional priorities that shape what the system can and cannot know.

Understanding errors requires genealogy. AI's systematic biases are products of specific historical conditions, not random failures of a general system — and correcting them requires understanding the conditions that produced them.

Debates & Critiques

A methodological debate concerns whether genealogical analysis of AI training data can be conducted with adequate rigor, given the scale and opacity of contemporary corpora. Critics argue that the specific decisions that shaped GPT's or Claude's training data are in many cases not publicly known, making precise genealogical work impossible. Defenders respond that genealogical analysis does not require complete documentation of every decision; it requires recognizing the kinds of decisions that shape any training corpus and the kinds of biases those decisions typically introduce. The productive approach is to combine available documentation with structural reasoning about the institutional conditions of training data production.

Appears in the Orange Pill Cycle

Lorraine Daston — On AI