
The cycle insists that [YOU] on AI take the machine seriously on its own terms: to understand what it actually does rather than what it appears to do. Intersectional disaggregation is one of the sharpest tools for doing this. The smooth performance of AI systems in aggregate is one of the primary mechanisms by which their costs are concealed—concealed not by anyone’s deliberate deception but by the structural choice of what to measure and how to aggregate it. To disaggregate is to refuse the comfortable number and demand the honest one, and the honest number almost always reveals a distribution of benefit and harm that the aggregate conceals.
The concept connects directly to the cycle’s concern with who holds the pen when the future is being written. A benchmark is not a neutral instrument; it encodes assumptions about which group’s performance matters enough to be separately measured, and a benchmark that does not measure at intersections is a benchmark organized by the priorities of whoever designed it. The design of the evaluation is the first site of politics, upstream of the technology itself, and Gebru’s most lasting contribution may be the demonstration that treating the evaluation as a site of contestation is not methodological obstructionism but basic scientific rigor.
The concept emerges from the Black feminist tradition of intersectionality, first articulated systematically by Kimberlé Crenshaw in her 1989 and 1991 papers on how legal frameworks failed Black women by treating discrimination as either racial or gendered rather than recognizing that the two could compound in ways that produced a distinct and worse harm. Crenshaw’s framework was theoretical and legal; Gebru and Buolamwini operationalized it for machine learning evaluation. The methodological move—disaggregate not just by one axis but by the intersection of multiple axes—translated an insight from social theory into a measurement protocol, showing that intersectionality is not merely a matter of justice but a matter of accuracy. A measurement framework that cannot see the intersection cannot accurately describe the world.
The practical implementation in “Gender Shades” required building a new benchmark. The researchers found that the commercially available face-recognition benchmarks were overwhelmingly composed of lighter-skinned, predominantly male faces, so they built their own test set from the official parliamentary portraits of African and European countries, balanced by both skin tone and gender. This move demonstrated that the existing benchmarks were not neutral sampling frames; they were politically loaded instruments that measured the performance of systems on populations the systems were designed for, which guaranteed that the populations the systems failed would be invisible in the evaluation. Building the benchmark that could see the harm was itself the scientific contribution.
The benchmark is a political instrument. The choice of what to measure, at what level of granularity, for which populations, under which conditions, determines what a system appears to do. A system can be genuinely dangerous to a specific population while appearing accurate by every published benchmark, because none of the benchmarks was designed to measure its performance on that population. Intersectional disaggregation refuses the appearance of accuracy as evidence of actual accuracy; it demands that the evaluation match the distribution of harm rather than the distribution of the test set’s designers.
Harms compound at intersections. The distinctive insight from the intersectional tradition is that multiple axes of marginalization do not simply add; they multiply. A system that is ninety-eight percent accurate for both women and Black people, measured separately, may be sixty-five percent accurate for Black women measured at their intersection, because the specific combination of features the system fails on is one that neither single-axis analysis would have caught. This compounding is not anomalous; it is the expected structure of a system trained on data that overrepresents some groups and underrepresents others, evaluated against benchmarks designed by those who belong to the overrepresented groups.
Representation is an epistemic condition, not a social nicety. Gebru’s argument that a homogeneous AI workforce produces systems with systematic blind spots is not primarily a fairness argument but an accuracy argument. A team that has not lived the harm that a system can cause will not know to test for it, will not know which evaluation design would reveal it, and will not know which benchmark omissions are load-bearing. The invisible labor of designing evaluations that can see what the system does to people at the margins is labor that a homogeneous team is structurally unlikely to perform, because the members of that team are unlikely to be the people the margins affect.
The strongest objections to intersectional disaggregation as a methodological requirement concern feasibility and overfitting. As the number of intersectional categories multiplies—skin tone by gender by age by disability status by country of origin—the test set required to have statistically meaningful performance measurements at every intersection grows combinatorially, and no practically achievable test set can be balanced across all intersections simultaneously. This is a genuine methodological tension, not a dismissal of the underlying concern. Gebru’s response has been pragmatic: the infeasibility of measuring everything is not a license to measure nothing, and the first obligation is to identify the intersections at which harm is most likely to concentrate and design the evaluation accordingly. A second objection holds that intersectional disaggregation can function as a tool of delay—that requiring ever-more-granular evaluation before deployment is a strategy for keeping beneficial systems from reaching the people who need them. The counter is empirical: the systems that have caused the most documented harm have almost always been deployed without adequate evaluation, not held back by excessive rigor. The problem the field faces is not too much evaluation but too little, and the too little has been systematically concentrated in the populations least able to bear its consequences.