The background claim is logical, not empirical. Programs accept inputs from effectively infinite spaces. Testing examines a finite subset of those spaces. No finite subset of an infinite space can stand proxy for the whole. This is not a limitation of current testing methodology that better tooling might overcome. It is a fixed property of the relationship between finite observation and universal claim. Dijkstra stated it in the form that has been quoted ever since: "Program testing can be used to show the presence of bugs, but never to show their absence."
The industrial objection to formal verification was always that it was impractical at real-world scale. Programs were too large, too complex, too entangled with messy requirements to submit to the clean lines of mathematical proof. Dijkstra's reply was that the programs were too large and too complex precisely because they had been built without the discipline that would have kept them manageable. Complexity was not a given but a consequence of intellectual laziness — the accumulated result of decisions made by people who could not be bothered to think clearly about what they were building.
In Dijkstra's own practice, the discipline took the form of program derivation: the program and its correctness proof were constructed simultaneously, so that the finished program was correct by construction rather than by coincidence. The programmer did not write the code and then verify it. She derived the code from its specification through a sequence of formal steps, each of which preserved correctness. The result was a program whose reliability was a property built in from the first line rather than a property tested after the fact.
The AI era has raised the stakes of the distinction by eliminating the verification layer entirely. AI-generated code exists wholly in the domain of evidence. The builder tests. The output passes. She ships. At no point does anyone demonstrate, through formal reasoning, that the code is correct — not because anyone has decided that proof is unnecessary, but because the builder often cannot read the code well enough to construct such a demonstration, and the generating system cannot provide it. The industry's historical preference for tested code over verified code has been amplified into a structural default.
The intellectual foundations of provable correctness were laid by Hoare's 1969 paper "An Axiomatic Basis for Computer Programming," which introduced the triple notation and the inference rules now known as Hoare logic. Dijkstra's 1975 paper on guarded commands and the predicate transformer semantics extended the framework and tied it more directly to program construction. His 1976 book A Discipline of Programming was the mature statement of the methodology.
Formal verification has continued as a research program and has produced operational tools — the CompCert verified compiler, the seL4 verified microkernel, the Astrée static analyzer that has been used on Airbus flight-control software. But these remain exceptions. The industry's default remains testing, and the gap between default practice and verified practice is where most of the software the world depends on lives.
Testing is induction; verification is deduction. Successful tests generalize from observed cases to unobserved cases; proofs derive conclusions from premises with logical necessity. One can fail; the other cannot.
Correctness should be constructed, not discovered. If the program's correctness is a property to be verified after the fact, the verification is already at a disadvantage; if it is a property built in during construction, every line earns its place by contributing to the proof.
The gap widens with AI generation. When the builder does not understand the implementation, her tests reflect her understanding of the requirements, not her understanding of the code. The tests answer "does it do what I wanted?" but not "does it do what I wanted under conditions I did not think to test?"
Impossibility results are real. The 2026 verification trilemma — that soundness, generality, and tractability cannot be simultaneously achieved for sufficiently complex systems — puts a mathematical ceiling on how much proof is available. But incomplete verification is still more reliable than no verification.
Acceptable track records are bets. A program that has worked for all tested inputs has an acceptable track record. Acceptable track records say nothing about untested inputs. Shipping on that basis is not confidence; it is a wager whose payoff is invisible until it fails.
The enduring argument inside the discipline is whether the insistence on provable correctness has ever been practical for the scale at which software is actually built. Partisans of lightweight methods — extensive testing, fuzzing, property-based checking — argue that Dijkstra's standard was aspirational from the beginning and that modern statistical verification captures most of the benefit at a fraction of the cost. Partisans of the Dijkstrian position reply that every decade's "acceptable" becomes the next decade's scandal, and that the question is not whether the standard is practical but what the consequences are of not meeting it.