The Benchmark and What the Data Hides

Page 1 · The Benchmark and What

EDO SEGAL: Timnit, Gender Shades is, to my mind, one of the most consequential pieces of measurement science of the century, and its genius wasn't a finding — it was a frame. You refused the single accuracy number and disaggregated by the intersection of skin tone and gender, and the harm that aggregate reporting was designed to hide became visible. I want you to teach the room the deeper lesson: that the choice of metric is itself a political act. And Pamela — I'm going to ask you, the storyteller who spent her life arguing the most important things can't be quantified, whether Timnit's numbers are your enemy or your ally.

GEBRU: The deeper lesson is that aggregate accuracy is a way of hiding. A single number — "this system is 95 percent accurate" — averages the well-served and the ill-served into one comforting figure, and that averaging isn't neutral. It's a decision about whose failures register and whose disappear into the mean. When Joy and I split the test set by the intersection of skin tone and gender, under one percent error for lighter men and nearly thirty-five percent for darker women, we weren't producing a better statistic. We were making an argument about what counts as evidence. And the reason it had to be intersectional — not race alone, not gender alone — is that harms compound at the crossing. A system audited only for gender bias or only for racial bias systematically misses the person who sits at both. Building the measurement to see her required first believing she was there to be seen. That's the whole thing. The field couldn't detect the harms it couldn't imagine, and it couldn't imagine what its workforce had never lived. The benchmark is never neutral. To choose a metric is to choose whose accuracy matters.

EDO SEGAL: Pamela. The novelist who distrusted numbers — friend or enemy of this number?

· · ·

Page 2 · The Benchmark and What

McCORDUCK: Friend, and more than friend — Timnit's numbers are the strongest vindication of my own method I've heard tonight, and I want to explain that, because it's not obvious. I spent my life arguing that the quantitative account of AI was radically incomplete — that the benchmarks and accuracy rates left out precisely the human meaning that made the technology matter, and that you needed narrative to hold what numbers couldn't. People took that to mean I was anti-number, soft, humanistic in the dismissive sense. What Timnit's done is show that the numbers themselves were telling a story — a false one, a story of universal competence — and that the way to break the false story wasn't to abandon measurement but to measure better, to ask whose accuracy the number was hiding. That's not the opposite of my method. That's my method done in statistics. She read the benchmark the way I read the founders — as a human artifact full of choices and exclusions, bearing the fingerprints of who made it. The aggregate number was a narrative pretending to be a fact. She found the people it wrote out. I couldn't have done it with her tools, but we were doing the same work.

· · ·

Page 3 · The Benchmark and What

GEBRU: I'll accept that, and I'll go further than you might expect, because I think you've earned it. You're right that the number was a story. And the thing most people miss about my work — the thing that makes the "data person versus story person" framing wrong — is that I'm not only a measurement scientist. The measurement is in service of a story that the field refused to tell: the story of the specific person at the margin, the darker-skinned woman misclassified, the Kenyan moderator, the community legible in satellite imagery. Where most AI discourse floats in grand abstractions — intelligence, humanity, the long-term future of the species — I keep dragging it back to particular people in particular situations, because the abstractions are usually a way of not seeing those people. The universal subject of so much AI talk turns out, on inspection, to be a very particular subject — affluent, Western, male — mistaking himself for everyone. Your aggregate "we" and the aggregate accuracy number are the same evasion. I fight both with the same move: name the specific person the average erased.

McCORDUCK: And there's the indictment of my sentence again, arriving from a third direction. "We have always dreamed of forging the gods" is an aggregate accuracy number for the species. You've been dismantling it all night with the same tool you used on IBM's face system. I built my career on insisting AI is a human story, and you've shown me I told the human story with the humans pre-selected — the average human, who turned out to be the powerful one. The discipline that would have caught my error is exactly yours: disaggregate the "we." Find who the average wrote out. I wish I'd had it in 1979.

· · ·

Page 4 · The Benchmark and What

EDO SEGAL: Timnit, I want to push you on the limit of your own tool, because the best questions cost the asker something and this one costs me — I believe in measurement and I'm about to ask whether measurement is enough. Disaggregation makes the hidden harm visible. But visibility isn't justice. The companies in Gender Shades responded, retrained, improved their numbers — the darker-skinned women are better classified now. And yet you've said the bug was never the point, that fixing the number can leave the real problem untouched. So tell me the limit. When does a better benchmark become a trap?

GEBRU: That's the question that separates my work from the version of it the industry was happy to adopt, and I'm glad you asked it, because the co-optation is real and it's dangerous. Yes — the companies fixed the numbers. And a face recognition system that now classifies darker-skinned women accurately is, in many deployments, a worse outcome, not a better one, because the thing it's being used for is surveillance, and you've just made the surveillance work on the people it was failing to see. I made the system better at finding them. That's the trap. Disaggregation tells you whether a system works equally well across groups. It does not tell you whether the system should exist. The benchmark can become an alibi: "we fixed the bias, ship it." And then the debiased tool, deployed by an unaccountable monopoly for a harmful purpose, is still an instrument of concentrated power — just a fairer one. A more equitable surveillance state is still a surveillance state. So the limit of my tool is exactly where Pamela's question about authorization begins: measurement can make a system fairer, but only the prior question — should this be built, by whom, for whom — can make it just. The number is necessary. It was never sufficient.

· · ·

Page 5 · The Benchmark and What

McCORDUCK: [pause] That may be the most important thing you've said about your own work tonight, and it's the opposite of how the world reads you. The world thinks you're the bias-fixer, the person who makes the algorithms fair. And you're telling me that fairness was always the smaller question, that a perfectly fair tool can be a perfectly unjust one, and that the real question — should this exist, and who decided — can't be reached by any measurement at all. That's not a technical position. That's a moral one, and it's where your statistics and my storytelling finally arrive at the same door: the question of meaning, of purpose, of what it's for — which no number and no benchmark can answer, only human judgment can.

EDO SEGAL: Let me restate the strange thing that just happened, because it's a convergence nobody would have predicted from the seating chart. The storyteller and the statistician just discovered they were running the same operation — reading a number, or a sentence, as a human artifact full of buried choices, and finding the person it erased. The disagreement between your tools turned out to be smaller than the agreement in your target. Hold that. But I'm not going to let the harmony stand, because there's a place where it breaks hard, and it's the place that cost Timnit her job — the corporate valve on the field's own knowledge, and whether a science owned by the powerful can ever tell the truth about itself. The next round goes there. After this.

· · ·

Continue · Chapter 11

The Valve on the Field's Own Knowledge

→