EVENT

The Big Dig Test

Flyvbjerg's January 2025 experiment in which he asked two leading AI systems a factual question whose answer he had personally published — and received confident, fluent, catastrophically wrong responses that became the founding empirical artifact of the artificial ignorance diagnosis.

In January 2025, Bent Flyvbjerg sat down with ChatGPT and Perplexity and asked each a simple question: what was the cost overrun of Boston's Central Artery/Tunnel Project, known as the Big Dig? Flyvbjerg knew the answer. He had published it in peer-reviewed journals. The figure — 220 percent — had been cited hundreds of times in the academic literature on infrastructure failure. ChatGPT got it wrong. Perplexity got it worse, returning 478 percent, a figure with no grounding in any published source. Neither system hedged. Neither flagged uncertainty. Both delivered their errors with the grammatically impeccable, rhetorically persuasive confidence that large language models are engineered to produce. The experiment became the empirical core of Flyvbjerg's paper 'AI as Artificial Ignorance,' which circulated with a viral intensity unusual for an infrastructure scholar's academic work.

In the AI Story

Hedcut illustration for The Big Dig Test — The Big Dig Test

The test's diagnostic power derives from the nature of the question. The Big Dig is not an obscure infrastructure project. It is among the most extensively documented megaprojects in American history. The 220 percent figure is not contested. It appears in peer-reviewed publications, government reports, and standard textbooks on project management. A competent research assistant with access to any academic database could have verified it in minutes. The AI systems' failure to do so — and to do so with confidence — revealed not a local error but a structural condition.

Flyvbjerg deployed the test the way a diagnostician deploys a symptom: not because the symptom is the disease but because the symptom makes the underlying pathology visible. If the systems cannot accurately retrieve a well-documented factual claim in their creator's own domain of expertise, the confident authority with which they pronounce on complex, ambiguous, high-stakes questions in domains where the user lacks expertise should provoke not admiration but alarm. The paint-color matching experiment Flyvbjerg conducted in parallel reinforced the diagnosis through a different route.

The 478 percent figure acquired a symbolic weight beyond its empirical wrongness. In Segal's foreword, it functions as the founding number of the entire book — the specific, quantified instance of confident wrongness that forced the author to recognize his own analogous experience with the fabricated Deleuze attribution. The number is diagnostic precisely because it cannot be plausibly defended. It is simply wrong, and the system that produced it could not flag the wrongness because the system had no mechanism for doing so.

The test also illuminated the political dimension of AI confidence. An error flagged with appropriate uncertainty — I am not sure, but my best estimate is — would have been acceptable, even useful. An error delivered with the same fluent confidence as every correct answer is something worse than a mistake. It is a structural misrepresentation of the system's epistemic state, because the presentation implies a grounding that the generation process does not provide. The user is left without the cues that human experts provide to calibrate trust.

Origin

Flyvbjerg conducted the test in January 2025 and published the results shortly afterward in Project Leadership and Society. The paper accumulated over four thousand downloads within months, an unusual level of engagement for infrastructure scholarship. It was widely shared on LinkedIn and professional networks, and contributed to the broader 2025 reassessment of generative AI's reliability for knowledge-intensive work.

Key Ideas

Objectively verifiable. The test's power derives from the fact that the answer is not contested, ambiguous, or judgment-dependent — it is a matter of published record.

Confident error. Both systems delivered wrong answers without hedging, making the error not merely a mistake but a misrepresentation of the system's epistemic state.

Domain expert as subject. The question was asked by the person whose peer-reviewed publications supply the correct answer, eliminating the possibility that the error reflected an arcane or contested fact.

478 as symbol. The specific wrong number became a symbolic anchor in subsequent discourse — the quantified instance of fluent wrongness that no plausibility check could catch.

Reproducible across systems. The error was not idiosyncratic to one model. Multiple frontier systems produced wrong answers, indicating architectural rather than local failure.

Appears in the Orange Pill Cycle

Bent Flyvbjerg — On AI

The Big Dig Test

In the AI Story

Origin

Key Ideas

Appears in the Orange Pill Cycle

Related Entries

Further reading