CONCEPT

Raw Data Is an Oxymoron

Gitelman's signature insight that data is never raw — always shaped by the instruments that collect it, the institutions that commission it, and the categories that determine what counts as data in the first place.

The word data comes from the Latin dare, to give — implying that data is given by the world, found rather than made, a natural resource waiting to be harvested. Gitelman demonstrated that this etymology is a lie embedded in the language. Data is not given; it is taken — extracted from the world through specific instruments, according to specific protocols, for specific purposes, by specific institutions with specific interests. What looks like a neutral description of the world is always already an interpretation, and the interpretive framework is usually invisible because it is embedded in the instruments and institutions that produce the data rather than in the data itself. The argument has acquired a second life in the age of AI, where AI-generated content performs the same sleight of hand as raw data: both suggest a product that arrives without mediation.

In The You On AI Field Guide

Gitelman edited the 2013 collection "Raw Data" Is an Oxymoron with contributions from historians of science, media theorists, and critical data studies scholars. The volume's central move was to apply the tools of media archaeology to the object the technology industry had treated as pre-cultural and pre-institutional: data itself. The result was a foundational text in critical data studies.

Applied to AI, the argument has teeth. AI-generated content is not generated from nothing. It is generated from training data, which was generated from documents, which were generated within institutional contexts governed by specific protocols of formatting, selection, and preservation. The training corpus is a specific collection shaped by what was digitized, what was publicly available, what was written in English, what survived the filters of platform terms of service, copyright law, and web architecture.

The output inherits these biases not as bugs to be fixed but as structural features of the data from which it was produced. When Claude draws a connection between Csikszentmihalyi's flow psychology and Han's critique of the achievement society, the connection is drawn not from the entirety of human knowledge but from a specific subset — the subset that was written in certain languages, published through certain channels, digitized by certain institutions, and made available under certain licensing regimes.

The political consequence of the oxymoron is that it refuses the move by which AI companies present their training corpora as everything — the entire history of human thought, the sum of human knowledge. The corpora are specific. They are shaped. They are cooked. And the output carries the marks of its cooking into every domain it touches.

Origin

The phrase originated as the provocative title of Gitelman's 2013 edited volume, drawing together contributions from Daniel Rosenberg, Geoffrey Bowker, Paul Edwards, and others who had been working independently on the constructed character of data. The title became one of the most cited phrases in contemporary data studies.

Key Ideas

The etymology is a lie. The word data implies givenness, but data is always taken — extracted through specific instruments for specific purposes.

Instruments carry protocols. The data-collection apparatus embeds institutional assumptions that shape what counts as data and what is excluded.

Cooking is invisible. The format of data as neutral measurement conceals the interpretive framework that produced it.

AI inherits the cooking. Training data is cooked data; AI outputs are cooked outputs; the fluency of the output does not indicate rawness of the input.

Political stakes. The oxymoron denaturalizes claims that data describes the world as such, opening the question of whose interests the data-collection infrastructure serves.

In The You On AI Field Guide

Origin

Key Ideas

Related Entries

Further Reading