CONCEPT

Datafication

Viktor Mayer-Schönberger and Kenneth Cukier's term for the transformation of a phenomenon into a quantified, computable format—distinct from mere digitisation—whereby aspects of the world that previously existed only as lived experience become data substrates for analysis, correlation, and machine learning.

Datafication is not the same as digitisation. To digitise a book is to convert its existing text into bits; the text existed before, and the bits merely carry it into a new medium. To datafy a phenomenon is to render an aspect of the world—one that was previously experienced but not recorded in computable form—into a quantified signal that did not exist before and can now be tabulated, correlated, and acted upon at scale. Location had always existed; the smartphone datafied it, converting every person's movement through space into a continuous stream of coordinates. Friendship had always existed; the social network datafied it, converting the soft, ambiguous bonds between people into a structured graph. The concept, coined by Viktor Mayer-Schönberger and Kenneth Cukier in Big Data (2013), identifies the hidden epistemology of large language models and every other AI system: these systems do not learn about the world. They learn about the datafied trace of the world, shaped by whoever designed the measurement, and the projection substitutes for the reality in every downstream decision. Whatever was never datafied is invisible to them. Whatever was datafied badly—with bias, with missing dimensions, with the wrong categories—is learned faithfully as though it were truth. The frontier of AI capability is, quite literally, the frontier of datafication: the leading edge of the process by which aspects of human life that previously resisted quantification are captured and fed to the machines.

In the [YOU] on AI Field Guide

The cycle's account of the AI amplifier gains its sharpest limitation from the concept of datafication. The amplifier carries the signal further; but which signal? Only the signal that has already been datafied—converted into the form the machine can process. Everything else is absent. When [YOU] on AI argues that the machine amplifies whatever you bring to it, datafication supplies the correction: the machine amplifies whatever about you has already been rendered as data. The rest of you—the embodied, felt, contextual, morally weighted dimensions of your experience—is outside the model's field of vision, not because the model is inadequate but because those dimensions were never datafied in the first place.

The professional identity disruption the cycle traces across knowledge-work communities is, in datafication's terms, a story about which dimensions of a practitioner's competence happen to have been datafied and which have not. The outputs—code, text, designs, briefs—were datafied, and the machine learned to produce them. The process by which those outputs were produced—the embodied tacit knowledge, the feel for when something is wrong before you can explain why, the judgment built through years of formative friction—was never datafied. The machine cannot replicate it because it was never in the training data. The elegists the cycle describes, who mourn what is being lost precisely because they lived it, are describing the gap between what was datafied and what was real.

Datafication also explains why the bias problems of AI systems are structural rather than incidental. A hiring model trained on a company's historical decisions does not learn who is a good employee. It learns who the company hired before, with all the bias and blindness that record encodes—encoded not by any single actor's malice but by the measurement choices that were invisible at the time of collection. Every act of datafication is an act of selection, and the dimensions left out of the measurement do not return in the model.

Origin

The concept was introduced by Viktor Mayer-Schönberger and Kenneth Cukier in Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013), alongside two other foundational shifts: the move from sampling to n-equals-all, and the privilege of correlation over causation. Of the three, datafication is the most fundamental, because it names the precondition for the other two. N-equals-all presupposes that there is an 'all' to capture; the expansion of what counts as 'all' is the expansion of datafication. Correlation without causation presupposes that the correlated variables have already been datafied; the range of possible correlations is bounded by the range of what has been measured.

The timing of the concept's introduction—2013, a decade before the large language model became a mass phenomenon—is part of what makes Mayer-Schönberger's work so useful for understanding the present. He was not describing AI. He was describing the data logic that would eventually make AI possible, and the description turns out to be more precise than most writing produced after the fact. The large language model is the ultimate expression of n-equals-all datafication: trained, as nearly as its builders can manage, on all human text ever produced, with the aspiration toward total capture that the datafication concept named before the ambition could be pursued at scale.

Key Ideas

Datafication as selective projection. To convert a phenomenon into data is to make a series of choices about what to measure and what to ignore. Friendship datafied as a graph of connections loses the strength, texture, and meaning of those connections—everything that makes friendship friendship rather than adjacency. Employee worth datafied through productivity metrics loses the dimensions of contribution that resist measurement and, over time, loses them from the institution's understanding of value itself. The projection substitutes for the reality, and the substitution is invisible in the resulting dataset.

The datafication frontier is the AI frontier. Every new capability of AI systems rests on some domain of human life having been datafied at sufficient scale. Image generation became possible when visual experience was datafied through billions of captioned photographs. Conversational AI became possible when human dialogue was datafied through the vast accumulation of text. The question 'What will the next generation of models be trained on?' is always the same question: 'What remains to be datafied?' The answer is almost always whatever is still lived but not yet recorded.

Datafication and permanent memory. The incentive logic of datafication pushes toward total capture and permanent retention: why delete anything when any fragment might prove valuable to some future correlation? This logic runs directly into the argument Mayer-Schönberger made four years before Big Data, in Delete: that forgetting was never a flaw to be engineered away but a feature that made second chances, forgiveness, and the release of the past possible. The solutionist instinct reaches for a technical fix—machine unlearning algorithms, differential privacy—while the structural insight is that the problem is datafication's default logic of accumulation without end.

The datafied trace is not the world. The deepest implication of datafication is epistemological. AI systems are trained on datafied traces, not on the world. They reproduce the selection biases of the measurement process, treat the frozen record as a permanent template, and are constitutively blind to whatever was never measured. A hiring model trained on historical decisions learns the biases of those decisions as faithfully as it learns anything else. A medical model trained on clinical records from populations that were underserved by the healthcare system learns underservice as a baseline. The map is not the territory; the datafied trace is not the world; and a civilisation that forgets the difference has handed the production of its future to the biases of its past.

Debates & Critiques

The concept of datafication sits in tension with the optimism of the AI industry's democratisation narrative. If datafication is always selective projection, then the models built on datafied traces always carry the selection biases of the measurement process, and the more the model is used, the more those biases become the ambient logic of the decisions it shapes. Critics of this framing argue that scale and diversity of data can wash out individual biases, producing a more representative picture than any single measurement could; Mayer-Schönberger's counter is that scale amplifies systematic biases rather than correcting them, because systematic biases are consistent across the data and therefore treated by the model as signal rather than noise. A second debate concerns the right to be forgotten and the technical problem of machine unlearning: if personal data is dissolved into model weights rather than stored as discrete records, the legal framework built around the right to erasure becomes technically incoherent, and the aspiration encoded in European data protection law—that individuals retain some control over their datafied traces—requires not just legal rules but architectural choices about how systems are built. Whether those choices will be made in the direction of forgetting, or whether the logic of datafication will continue to push toward permanent capture, is one of the defining political questions of the AI age.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

Related Entries

Further Reading