In January 2025, Bent Flyvbjerg sat down with ChatGPT and Perplexity and asked each a simple question: what was the cost overrun of Boston's Central Artery/Tunnel Project, known as the Big Dig? Flyvbjerg knew the answer. He had published it in peer-reviewed journals. The figure — 220 percent — had been cited hundreds of times in the academic literature on infrastructure failure. ChatGPT got it wrong. Perplexity got it worse, returning 478 percent, a figure with no grounding in any published source. Neither system hedged. Neither flagged uncertainty. Both delivered their errors with the grammatically impeccable, rhetorically persuasive confidence that large language models are engineered to produce. The experiment became the empirical core of Flyvbjerg's paper 'AI as Artificial Ignorance,' which circulated with a viral intensity unusual for an infrastructure scholar's academic work.
The test's diagnostic power derives from the nature of the question. The Big Dig is not an obscure infrastructure project. It is among the most extensively documented megaprojects in American history. The 220 percent figure is not contested. It appears in peer-reviewed publications, government reports, and standard textbooks on project management. A competent research assistant with access to any academic database could have verified it in minutes. The AI systems' failure to do so — and to do so with confidence — revealed not a local error but a structural condition.
Flyvbjerg deployed the test the way a diagnostician deploys a symptom: not because the symptom is the disease but because the symptom makes the underlying pathology visible. If the systems cannot accurately retrieve a well-documented factual claim in their creator's own domain of expertise, the confident authority with which they pronounce on complex, ambiguous, high-stakes questions in domains where the user lacks expertise should provoke not admiration but alarm. The paint-color matching experiment Flyvbjerg conducted in parallel reinforced the diagnosis through a different route.
The 478 percent figure acquired a symbolic weight beyond its empirical wrongness. In Segal's foreword, it functions as the founding number of the entire book — the specific, quantified instance of confident wrongness that forced the author to recognize his own analogous experience with the fabricated Deleuze attribution. The number is diagnostic precisely because it cannot be plausibly defended. It is simply wrong, and the system that produced it could not flag the wrongness because the system had no mechanism for doing so.
The test also illuminated the political dimension of AI confidence. An error flagged with appropriate uncertainty — I am not sure, but my best estimate is — would have been acceptable, even useful. An error delivered with the same fluent confidence as every correct answer is something worse than a mistake. It is a structural misrepresentation of the system's epistemic state, because the presentation implies a grounding that the generation process does not provide. The user is left without the cues that human experts provide to calibrate trust.
Flyvbjerg conducted the test in January 2025 and published the results shortly afterward in Project Leadership and Society. The paper accumulated over four thousand downloads within months, an unusual level of engagement for infrastructure scholarship. It was widely shared on LinkedIn and professional networks, and contributed to the broader 2025 reassessment of generative AI's reliability for knowledge-intensive work.
Objectively verifiable. The test's power derives from the fact that the answer is not contested, ambiguous, or judgment-dependent — it is a matter of published record.
Confident error. Both systems delivered wrong answers without hedging, making the error not merely a mistake but a misrepresentation of the system's epistemic state.
Domain expert as subject. The question was asked by the person whose peer-reviewed publications supply the correct answer, eliminating the possibility that the error reflected an arcane or contested fact.
478 as symbol. The specific wrong number became a symbolic anchor in subsequent discourse — the quantified instance of fluent wrongness that no plausibility check could catch.
Reproducible across systems. The error was not idiosyncratic to one model. Multiple frontier systems produced wrong answers, indicating architectural rather than local failure.