CONCEPT

Discriminating Data

Chun's radical 2021 claim: the statistical methods underlying AI—correlation, regression, pattern recognition—carry a eugenic genealogy, designed to sort populations and embedded with assumptions about human sortability.

Discriminating data names the dual operation by which algorithmic systems both rely on discrimination (pattern-matching, categorization, correlation) and reproduce discrimination (racial, economic, geographic sorting effects). Chun traces the mathematical apparatus of machine learning—correlation coefficients, regression analysis, clustering algorithms—to their historical origins in Francis Galton's eugenic project: the scientific management of human heredity through statistical sorting. Galton developed correlation explicitly as a tool for predicting which populations should reproduce and which should not. The mathematics he invented for eugenic purposes became the foundation of modern statistics, and those methods—carrying their original design assumptions about the sortability of human populations—now power the pattern-matching engines of contemporary AI. The training data reflects existing distributions of power, recognition, and opportunity. The model learns these patterns. The outputs reproduce them. Not through malicious intent but through the statistical mechanics of pattern-matching against a biased corpus.

In the AI Story

Hedcut illustration for Discriminating Data — Discriminating Data

Chun's Discriminating Data: Correlation, Neighborhoods, and the New Politics of Recognition (2021) documents that big data is, as she writes, "arguably the bastard child of psychoanalysis and eugenics." The methods were not neutral scientific tools later misapplied to social questions; they were designed from inception as instruments of social sorting. Pearson's correlation coefficient. Fisher's regression. The entire mathematical infrastructure of machine learning carries this history not as distant origin but as embedded design logic: populations can be meaningfully sorted into categories, individual behavior can be predicted from group membership, correlation between observable features and outcomes does constitute a basis for action. These are not empirical discoveries. They are methodological assumptions, built into the mathematics, inherited by every system that uses the mathematics.

Applied to Segal's democratization argument—that AI tools lower the floor of who gets to build—Chun's analysis reveals a structural asymmetry. The developer in Lagos has access to the same model as the engineer at Google, true. But the model was trained on a corpus reflecting Google engineers' practices, assumptions, and problem-framings far more than Lagos developers' contexts, constraints, and opportunities. The model does not refuse to serve the Lagos developer; it serves by generating outputs optimized for a different population's patterns. The amplifier amplifies, but it amplifies the signal it was trained on—a particular signal from a particular demographic operating under particular conditions. The developer outside that demographic receives outputs shaped by someone else's world.

Pattern discrimination operates not through explicit exclusion but through statistical mechanics. Machine learning systems can achieve racial discrimination without processing race as a variable—ZIP code, browsing history, social network composition correlate with race closely enough that optimizing on these proxies reproduces racial sorting without ever "seeing" race. Chun's phrase: the model "embeds whiteness as a default." Not through conspiracy but through the homophily principle—like attracts like, patterns learned from the training data favor outputs resembling the training data, and the training data reflects the world as it already is, structured by centuries of accumulated discrimination.

Origin

Chun's genealogical method—tracing contemporary technical practices to their ideological origins—has roots in Foucault's Discipline and Punish and the broader critical-theory tradition. But her specific application to statistics is original and consequential. By demonstrating that correlation itself—the most basic operation of machine learning—was invented by Galton to serve eugenic sorting, she establishes that AI's bias problem is not a bug to be patched but a feature of the mathematical methods themselves. The methods remember what the practitioners have forgotten.

The book synthesizes critical race theory (Patricia Hill Collins, Kimberlé Crenshaw on intersectionality), science and technology studies (Ruha Benjamin on the "New Jim Code," Safiya Noble on algorithmic oppression), and the history of statistics (Theodore Porter, Ian Hacking on quantification's politics). Chun's contribution is to weave these threads into a unified argument: that AI's discriminatory effects are not failures of implementation but structural features of the statistical paradigm, and that addressing them requires not better data collection but recognition that the methods themselves embed assumptions about human sortability that no diversity initiative can eliminate.

Key Ideas

Eugenic genealogy of correlation. The mathematical methods underlying machine learning were designed by Francis Galton explicitly for eugenic sorting—predicting and controlling who reproduces with whom—and carry those design assumptions into contemporary AI systems.

The mathematics remembers. Even when practitioners have abandoned eugenic ideology, the statistical methods preserve the original design logic—that populations can be sorted, individuals can be predicted from group membership, correlation justifies action.

Pattern discrimination without explicit categories. Machine learning systems achieve racial, economic, and geographic discrimination without processing race, class, or location directly—proxy variables correlate closely enough to reproduce sorting effects.

Homophily as mechanism. Like attracts like; models trained on patterns generate outputs conforming to those patterns; outputs that resemble the training data are statistically likely, divergent outputs systematically suppressed through probability rather than censorship.

Democratization is partial. AI tools expand access (who can use the tool) while reproducing inequality through output bias (whose contexts, assumptions, and needs the tool was optimized to serve)—the amplifier is not neutral; it amplifies the training distribution.

Appears in the Orange Pill Cycle

Wendy Chun — On AI

Discriminating Data

In the AI Story

Origin

Key Ideas

Appears in the Orange Pill Cycle

Related Entries

Further reading