CONCEPT

The Calibration Problem (Daston)

The structural difficulty of developing accurate user models of a knowledge technology's reliability — and the specific way AI impairs both conditions (error detection and error identification) on which calibration historically depends.

Calibration is the process by which users of a knowledge-producing technology learn to assess its outputs with accuracy proportionate to the outputs' actual reliability — extending trust where warranted and withholding it where not. It develops through a specific mechanism: the user produces or receives an output, compares it against some independent source of information, discovers a discrepancy, and updates her model of the technology's reliability. Over time, through repeated encounters with discrepancies of different types, the user develops a calibrated intuition — a sense, not always fully articulable but operationally reliable, of when the technology can be trusted and when it cannot. The mechanism depends on two conditions: errors must be detectable (independent information exists against which outputs can be compared) and identifiable (errors can be distinguished from accurate outputs, carrying markers that signal unreliability).

In the AI Story

Hedcut illustration for The Calibration Problem (Daston) — The Calibration Problem (Daston)

The calibration mechanism has been studied in detail across the history of scientific instrumentation. The microscopist encountering chromatic aberration had, in principle, access to independent information: she could compare the microscopic image with what she knew about the specimen from other methods. But the whole point of the microscope was to reveal features invisible to other methods. The independent information was, by definition, unavailable for the very features the microscope was most valued for revealing. The technology was most trusted precisely in the domain where calibration was most difficult — where outputs could not be compared against independent evidence because no independent evidence existed.

Daston identified this as a recurrent feature of the epistemological crises that accompany new instruments of observation. The telescope revealed features of celestial objects that could not be confirmed by naked-eye observation. The seismograph recorded patterns in the earth's vibrations that could not be felt by human senses. In each case, the instrument was valued for extending perception beyond its previous limits, and in each case, the extension created a domain of evidence for which no independent check existed. The instrument's most novel outputs were its least calibratable.

AI exhibits this structure with particular intensity, and with a specific twist. The technology is most frequently consulted in domains where the user lacks independent knowledge — where the consultation is motivated precisely by the gap between what the user knows and what the technology can provide. A researcher who consults AI for a summary of a literature she has not read has no independent basis for evaluating the summary's accuracy. A student who asks AI to explain a concept she does not understand has no independent basis for assessing whether the explanation is correct. In each of these cases — and they constitute the majority of AI use cases — the first condition for calibration is structurally unsatisfied.

The second condition fares no better. Previous technologies' errors had characteristic signatures: the microscope's chromatic aberrations had identifiable visual forms; the photograph's chemical artifacts had recognizable grain patterns. These signatures required training to recognize, but they existed, and their existence meant that calibration was achievable through accumulated experience with the technology's characteristic failure modes. AI-generated text has no characteristic error signature. A factually wrong statement is grammatically, syntactically, and rhetorically identical to a factually correct one. The confident assertion of a false claim is indistinguishable, in every surface feature, from the confident assertion of a true one. Errors are not merely difficult to detect — in their surface presentation, they are invisible.

Origin

The calibration concept has a long history in engineering, measurement theory, and cognitive psychology. Daston's specific contribution was to generalize it beyond the context of physical instruments and to analyze its institutional conditions — the community structures, training programs, and evaluative practices through which calibration has historically been achieved. Her research on scientific communities, particularly in Histories of Scientific Observation (2011), documented the social and material infrastructure on which individual calibration depends.

The argument that AI creates a particularly severe calibration problem emerged from the confluence of this earlier work with the empirical observation, accumulating rapidly after 2022, that AI users were systematically mis-calibrated — extending trust beyond what the technology's reliability warranted, in ways that individual awareness alone seemed unable to correct.

Key Ideas

Calibration requires detectable errors. Users must have access to independent information against which outputs can be compared — a condition structurally impaired when users consult the technology precisely because they lack such information.

Calibration requires identifiable errors. Previous technologies produced errors with characteristic signatures; AI errors are indistinguishable in surface features from accurate outputs.

The paradox of novel outputs. Technologies are most valued for extending capability beyond prior methods, but exactly this extension creates domains where no independent check exists.

Prospective calibration failure. AI's most consequential calibration losses may not be present errors but prospective erosion of evaluative capacities — a category without clean historical precedent.

Institutional, not individual, achievement. Calibration has historically been produced by communities with shared instruments, compared observations, and training programs; the equivalent infrastructure for AI does not yet exist in adequate form.

Debates & Critiques

A recurrent debate concerns whether AI's calibration problem is qualitatively different from earlier calibration problems or merely a severe case of a general pattern. Defenders of the 'qualitatively different' view point to the absence of error signatures and the structural impairment of the detection condition; defenders of the 'severe case' view argue that earlier technologies had their own versions of these difficulties that were eventually overcome through institutional innovation, and that the same will prove true for AI. The position this volume takes is that the difference is one of degree severe enough to require institutional responses of unprecedented scale — acknowledging continuity with historical patterns while emphasizing the specific features that make this calibration problem particularly acute.

Appears in the Orange Pill Cycle

Lorraine Daston — On AI