You On AI Field Guide · Testing Is Not Verification The You On AI Field Guide Home
Txt Low Med High
CONCEPT

Testing Is Not Verification

Dijkstra's load-bearing distinction — "testing can show the presence of bugs, but never their absence" — applied to a world where it passed the tests has become the industry's stand-in for it is correct.
The difference between testing and verification is the difference between evidence and proof, and the industry runs almost entirely on evidence. A program that has passed a thousand tests may pass the thousand-and-first; it may also fail catastrophically on it. The number is irrelevant. Testing is induction — generalizing from observed cases to unobserved cases — and induction can fail. Verification is deduction — deriving conclusions from premises with logical necessity — and deduction cannot fail, provided the premises are true and the reasoning is valid. Dijkstra stated the distinction in a sentence so quotable that it has become a platitude, which is the worst thing that can happen to a theorem. Platitudes are nodded at. Theorems are acted on. The sentence is a theorem.
Testing Is Not Verification
Testing Is Not Verification

In The You On AI Field Guide

The logical structure is fixed. A program that accepts inputs from an effectively infinite space cannot be demonstrated correct by examining any finite subset of that space. The untested inputs remain. Any one of them might trigger a failure. No amount of successful testing changes this, because the relationship between finite sample and universal claim is not affected by sample size. This is not a deficiency of current testing that better methods might overcome. It is a property of the logic of the claim.

The industrial practice that grew up around this impossibility was to treat testing as a sufficient basis for deployment. Programs are tested, not verified. They are shipped when tests pass, not when correctness has been demonstrated. Dijkstra did not deny the practical reasons for this — formal verification of real-world systems is genuinely difficult, often harder than building the systems themselves — but he denied that practical difficulty excused intellectual surrender. The fact that verification is hard does not make testing sufficient. It makes the gap between what the industry does and what reliability requires a permanent structural risk.

Provable Correctness
Provable Correctness

AI-generated code widens the gap to the point where it becomes quantifiable. The most revealing test cases — the ones that expose subtle failures, that probe boundary conditions — are designed by people who understand the implementation. A tester who knows a sorting algorithm uses a particular partitioning strategy can design tests that exercise the partition's edge cases. A tester who knows only that the function is supposed to sort can test obvious cases. When the builder did not write the code, she cannot know its implementation strategy, which means her tests reflect her understanding of the requirements rather than her understanding of the code — and the bugs live in the code.

The 2024 Purdue study found that ChatGPT's answers to programming questions were incorrect fifty-two percent of the time, yet users preferred the AI's responses for their fluency and apparent comprehensiveness. The errors were concealed by the quality of the presentation — precisely the confident wrongness dressed in good prose that Segal identified in his own collaboration with Claude. When the surface is polished, errors beneath it become harder to detect, and harder to detect means less likely to be caught before deployment.

Origin

The sentence "program testing can be used to show the presence of bugs, but never to show their absence" appears in Dijkstra's 1970 EWD303, "Notes on Structured Programming," and was repeated in various forms throughout his career. The underlying logical observation is older — it is, in effect, David Hume's problem of induction applied to software — but Dijkstra's framing made it operational for programmers who had never heard of Hume.

The 2026 verification trilemma — that soundness, generality, and tractability cannot be simultaneously satisfied — has been read as either a vindication of Dijkstra's concerns (the gap is formally unclosable) or a refutation of his program (if full verification is impossible, testing plus care must suffice). Both readings miss his point, which is not that verification is always achievable but that the distinction between verified and tested must never be elided.

Key Ideas

Verification Trilemma
Verification Trilemma

Induction is not deduction. Successful tests generalize from observed to unobserved; proofs derive universal claims from premises. The first can fail. The second cannot.

Finite sample, infinite space. No finite subset of an infinite input space establishes a universal claim about the space. This is a property of logic, not of testing methodology.

Test quality depends on implementation understanding. The tester who understands the code designs tests that target its actual vulnerabilities. The tester who does not, tests what she can think of — and what she can think of is limited by what she understands.

Fluency conceals error. AI-generated output that is coherent, well-structured, and stylistically correct can be substantively wrong. The surface qualities the builder can evaluate are orthogonal to the substantive quality she cannot.

Hoare Logic
Hoare Logic

Managed verification beats none. Given the formal impossibility of complete verification for complex systems, the alternative is not to abandon verification but to practice it explicitly — covering what can be covered, documenting what is not, directing test effort toward the gaps.

Debates & Critiques

The reformist position within AI safety — the Guaranteed Safe AI framework — is essentially Dijkstrian: a world model, a safety specification, an auditable proof certificate that the system satisfies the specification. The skeptical reply is that the 2026 impossibility result puts a mathematical ceiling on this approach for systems of real complexity. The practical consequence is that verification will always be incomplete where it matters most, and the incompleteness has to be managed rather than denied. What has not been resolved is whether the industry will adopt managed verification at all or continue to ship on tests alone, as Dijkstra predicted it would.

Further Reading

  1. Edsger W. Dijkstra, "Notes on Structured Programming" (EWD249/EWD303, 1969–1970)
  2. James A. Whittaker, How to Break Software (Addison-Wesley, 2002)
  3. Andreas Zeller, Why Programs Fail: A Guide to Systematic Debugging (Morgan Kaufmann, 2009)
  4. Glenford J. Myers, The Art of Software Testing (Wiley, third edition 2011)
  5. "On the Formal Limits of Alignment Verification" (Alignment Forum, 2026)

Three Positions on Testing Is Not Verification

From Chapter 15 — how the Boulder, the Believer, and the Beaver each read this concept
Boulder · Refusal
Han's diagnosis
The Boulder sees in Testing Is Not Verification evidence of the pathology — that refusal, not adaptation, is the correct posture. The garden, the analog life, the smartphone that is not bought.
Believer · Flow
Riding the current
The Believer sees Testing Is Not Verification as the river's direction — lean in. Trust that the technium, as Kevin Kelly argues, wants what life wants. Resistance is fear, not wisdom.
Beaver · Stewardship
Building dams
The Beaver sees Testing Is Not Verification as an opportunity for construction. Neither refuse nor surrender — build the institutional, attentional, and craft governors that shape the river around the things worth preserving.

Read Chapter 15 in the book →

Explore more
Browse the full You On AI Field Guide — over 8,500 entries
← Home 0%
CONCEPT Book →