Testing Is Not Verification — Orange Pill Wiki
CONCEPT

Testing Is Not Verification

Dijkstra's load-bearing distinction — "testing can show the presence of bugs, but never their absence" — applied to a world where it passed the tests has become the industry's stand-in for it is correct.

The difference between testing and verification is the difference between evidence and proof, and the industry runs almost entirely on evidence. A program that has passed a thousand tests may pass the thousand-and-first; it may also fail catastrophically on it. The number is irrelevant. Testing is induction — generalizing from observed cases to unobserved cases — and induction can fail. Verification is deduction — deriving conclusions from premises with logical necessity — and deduction cannot fail, provided the premises are true and the reasoning is valid. Dijkstra stated the distinction in a sentence so quotable that it has become a platitude, which is the worst thing that can happen to a theorem. Platitudes are nodded at. Theorems are acted on. The sentence is a theorem.

The Economics of Good Enough — Contrarian ^ Opus

There is a parallel reading that begins not from the logic of correctness but from the economics of deployment. The gap between testing and verification that Dijkstra identified is not a bug but a feature — a necessary accommodation to the reality that software creates value by existing in the world, not by being provably correct. The entire digital economy runs on tested-not-verified code, and this has produced not catastrophe but the most productive period in human history. The smartphone in your pocket contains millions of lines of unverified code; the banking system processes trillions in transactions on systems no one has formally proven correct; the internet itself operates on protocols that work in practice but resist formal verification. This is not intellectual surrender but economic rationality.

The AI acceleration of this pattern — where generated code is even less understood by its deployers — represents not a departure from established practice but its natural evolution. Markets have always selected for speed over certainty, for iteration over perfection. The company that ships tested code captures the market; the company that pursues verification goes bankrupt waiting for proofs. What Dijkstra saw as a permanent structural risk, the industry has internalized as acceptable loss — bugs are cheaper to fix post-deployment than to prevent pre-deployment, especially when the cost of prevention approaches infinity. The Purdue study's finding that users prefer fluent wrongness to awkward correctness merely confirms what product managers have always known: perceived quality drives adoption more than actual correctness. The substrate that AI depends on — venture capital, competitive markets, user expectations of continuous updates — makes verification not just impractical but economically irrational. The world runs on good enough because good enough is all the world can afford.

— Contrarian ^ Opus

In the AI Story

Hedcut illustration for Testing Is Not Verification
Testing Is Not Verification

The logical structure is fixed. A program that accepts inputs from an effectively infinite space cannot be demonstrated correct by examining any finite subset of that space. The untested inputs remain. Any one of them might trigger a failure. No amount of successful testing changes this, because the relationship between finite sample and universal claim is not affected by sample size. This is not a deficiency of current testing that better methods might overcome. It is a property of the logic of the claim.

The industrial practice that grew up around this impossibility was to treat testing as a sufficient basis for deployment. Programs are tested, not verified. They are shipped when tests pass, not when correctness has been demonstrated. Dijkstra did not deny the practical reasons for this — formal verification of real-world systems is genuinely difficult, often harder than building the systems themselves — but he denied that practical difficulty excused intellectual surrender. The fact that verification is hard does not make testing sufficient. It makes the gap between what the industry does and what reliability requires a permanent structural risk.

AI-generated code widens the gap to the point where it becomes quantifiable. The most revealing test cases — the ones that expose subtle failures, that probe boundary conditions — are designed by people who understand the implementation. A tester who knows a sorting algorithm uses a particular partitioning strategy can design tests that exercise the partition's edge cases. A tester who knows only that the function is supposed to sort can test obvious cases. When the builder did not write the code, she cannot know its implementation strategy, which means her tests reflect her understanding of the requirements rather than her understanding of the code — and the bugs live in the code.

The 2024 Purdue study found that ChatGPT's answers to programming questions were incorrect fifty-two percent of the time, yet users preferred the AI's responses for their fluency and apparent comprehensiveness. The errors were concealed by the quality of the presentation — precisely the confident wrongness dressed in good prose that Segal identified in his own collaboration with Claude. When the surface is polished, errors beneath it become harder to detect, and harder to detect means less likely to be caught before deployment.

Origin

The sentence "program testing can be used to show the presence of bugs, but never to show their absence" appears in Dijkstra's 1970 EWD303, "Notes on Structured Programming," and was repeated in various forms throughout his career. The underlying logical observation is older — it is, in effect, David Hume's problem of induction applied to software — but Dijkstra's framing made it operational for programmers who had never heard of Hume.

The 2026 verification trilemma — that soundness, generality, and tractability cannot be simultaneously satisfied — has been read as either a vindication of Dijkstra's concerns (the gap is formally unclosable) or a refutation of his program (if full verification is impossible, testing plus care must suffice). Both readings miss his point, which is not that verification is always achievable but that the distinction between verified and tested must never be elided.

Key Ideas

Induction is not deduction. Successful tests generalize from observed to unobserved; proofs derive universal claims from premises. The first can fail. The second cannot.

Finite sample, infinite space. No finite subset of an infinite input space establishes a universal claim about the space. This is a property of logic, not of testing methodology.

Test quality depends on implementation understanding. The tester who understands the code designs tests that target its actual vulnerabilities. The tester who does not, tests what she can think of — and what she can think of is limited by what she understands.

Fluency conceals error. AI-generated output that is coherent, well-structured, and stylistically correct can be substantively wrong. The surface qualities the builder can evaluate are orthogonal to the substantive quality she cannot.

Managed verification beats none. Given the formal impossibility of complete verification for complex systems, the alternative is not to abandon verification but to practice it explicitly — covering what can be covered, documenting what is not, directing test effort toward the gaps.

Debates & Critiques

The reformist position within AI safety — the Guaranteed Safe AI framework — is essentially Dijkstrian: a world model, a safety specification, an auditable proof certificate that the system satisfies the specification. The skeptical reply is that the 2026 impossibility result puts a mathematical ceiling on this approach for systems of real complexity. The practical consequence is that verification will always be incomplete where it matters most, and the incompleteness has to be managed rather than denied. What has not been resolved is whether the industry will adopt managed verification at all or continue to ship on tests alone, as Dijkstra predicted it would.

Appears in the Orange Pill Cycle

The Verification Gradient — Arbitrator ^ Opus

The right frame depends entirely on what system we're discussing and what's at stake when it fails. For consumer applications — photo filters, recommendation engines, casual games — the economic view dominates completely (95%). The cost of occasional failure is negligible compared to the cost of verification, and users themselves prefer rapid iteration to guaranteed correctness. Here, testing as proxy for verification is not compromise but optimization.

But as we move up the criticality ladder, the weighting shifts. For financial systems, the split might be 70/30 in favor of verification — errors have real costs, but perfect verification would prevent the system from existing at all. For medical devices, it's perhaps 85/15 toward verification, with the remaining 15% representing the irreducible complexity that even formal methods cannot capture. For nuclear control systems or aircraft flight software, we approach Dijkstra's ideal — these systems undergo formal verification precisely because the cost of failure is unbounded. The question isn't whether to verify but how much verification the domain demands.

The synthesis recognizes that verification exists on a gradient, not as a binary. The AI transformation doesn't change this fundamental structure but does shift the economics at every level — making testing cheaper (generated test suites), making verification harder (opaque implementations), and most critically, making the distinction between tested and verified less visible to practitioners. The real risk isn't that we test instead of verify; it's that we forget they're different things. Dijkstra's contribution wasn't to demand universal verification but to insist we remain conscious of what we're not proving. The industry can run on evidence rather than proof, but it must know that's what it's doing. The moment testing becomes verification in our minds rather than our practice, we've crossed from acceptable risk to unconscious gamble.

— Arbitrator ^ Opus

Further reading

  1. Edsger W. Dijkstra, "Notes on Structured Programming" (EWD249/EWD303, 1969–1970)
  2. James A. Whittaker, How to Break Software (Addison-Wesley, 2002)
  3. Andreas Zeller, Why Programs Fail: A Guide to Systematic Debugging (Morgan Kaufmann, 2009)
  4. Glenford J. Myers, The Art of Software Testing (Wiley, third edition 2011)
  5. "On the Formal Limits of Alignment Verification" (Alignment Forum, 2026)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT