CONCEPT

Testing Is Not Verification

Dijkstra's load-bearing distinction — "testing can show the presence of bugs, but never their absence" — applied to a world where <em>it passed the tests</em> has become the industry's stand-in for <em>it is correct</em>.

The difference between testing and verification is the difference between evidence and proof, and the industry runs almost entirely on evidence. A program that has passed a thousand tests may pass the thousand-and-first; it may also fail catastrophically on it. The number is irrelevant. Testing is induction — generalizing from observed cases to unobserved cases — and induction can fail. Verification is deduction — deriving conclusions from premises with logical necessity — and deduction cannot fail, provided the premises are true and the reasoning is valid. Dijkstra stated the distinction in a sentence so quotable that it has become a platitude, which is the worst thing that can happen to a theorem. Platitudes are nodded at. Theorems are acted on. The sentence is a theorem.

In The You On AI Field Guide

The logical structure is fixed. A program that accepts inputs from an effectively infinite space cannot be demonstrated correct by examining any finite subset of that space. The untested inputs remain. Any one of them might trigger a failure. No amount of successful testing changes this, because the relationship between finite sample and universal claim is not affected by sample size. This is not a deficiency of current testing that better methods might overcome. It is a property of the logic of the claim.

The industrial practice that grew up around this impossibility was to treat testing as a sufficient basis for deployment. Programs are tested, not verified. They are shipped when tests pass, not when correctness has been demonstrated. Dijkstra did not deny the practical reasons for this — formal verification of real-world systems is genuinely difficult, often harder than building the systems themselves — but he denied that practical difficulty excused intellectual surrender. The fact that verification is hard does not make testing sufficient. It makes the gap between what the industry does and what reliability requires a permanent structural risk.

AI-generated code widens the gap to the point where it becomes quantifiable. The most revealing test cases — the ones that expose subtle failures, that probe boundary conditions — are designed by people who understand the implementation. A tester who knows a sorting algorithm uses a particular partitioning strategy can design tests that exercise the partition's edge cases. A tester who knows only that the function is supposed to sort can test obvious cases. When the builder did not write the code, she cannot know its implementation strategy, which means her tests reflect her understanding of the requirements rather than her understanding of the code — and the bugs live in the code.

The 2024 Purdue study found that ChatGPT's answers to programming questions were incorrect fifty-two percent of the time, yet users preferred the AI's responses for their fluency and apparent comprehensiveness. The errors were concealed by the quality of the presentation — precisely the confident wrongness dressed in good prose that Segal identified in his own collaboration with Claude. When the surface is polished, errors beneath it become harder to detect, and harder to detect means less likely to be caught before deployment.

Origin

The sentence "program testing can be used to show the presence of bugs, but never to show their absence" appears in Dijkstra's 1970 EWD303, "Notes on Structured Programming," and was repeated in various forms throughout his career. The underlying logical observation is older — it is, in effect, David Hume's problem of induction applied to software — but Dijkstra's framing made it operational for programmers who had never heard of Hume.

The 2026 verification trilemma — that soundness, generality, and tractability cannot be simultaneously satisfied — has been read as either a vindication of Dijkstra's concerns (the gap is formally unclosable) or a refutation of his program (if full verification is impossible, testing plus care must suffice). Both readings miss his point, which is not that verification is always achievable but that the distinction between verified and tested must never be elided.

Key Ideas

Induction is not deduction. Successful tests generalize from observed to unobserved; proofs derive universal claims from premises. The first can fail. The second cannot.

Finite sample, infinite space. No finite subset of an infinite input space establishes a universal claim about the space. This is a property of logic, not of testing methodology.

Test quality depends on implementation understanding. The tester who understands the code designs tests that target its actual vulnerabilities. The tester who does not, tests what she can think of — and what she can think of is limited by what she understands.

Fluency conceals error. AI-generated output that is coherent, well-structured, and stylistically correct can be substantively wrong. The surface qualities the builder can evaluate are orthogonal to the substantive quality she cannot.

Managed verification beats none. Given the formal impossibility of complete verification for complex systems, the alternative is not to abandon verification but to practice it explicitly — covering what can be covered, documenting what is not, directing test effort toward the gaps.

Explore more

Browse the full You On AI Field Guide — over 8,500 entries