Provable Correctness — Orange Pill Wiki
CONCEPT

Provable Correctness

Dijkstra's insistence that a program's correctness should be established by proof — formal reasoning from specification to implementation — not by the accumulated evidence of tests that passed.

Provable correctness is the standard Dijkstra held up against the software industry for fifty years and that the industry spent fifty years finding reasons not to meet. The claim is simple: a program is a formal argument; its correctness is either demonstrable through logical reasoning or it is not; and the accumulation of successful test executions, however voluminous, does not constitute a demonstration. It can at most establish that the program behaves correctly for the tested cases. The untested cases remain, and in principle any one of them might trigger a failure that no tested input revealed. The difference between tested and verified is the difference between evidence and proof, and in domains where failure has consequences — infrastructure, medicine, finance, anything that affects human safety — the difference is not pedantic.

The Material Substrate of Proof — Contrarian ^ Opus

There is a parallel reading that begins not with the logical structure of programs but with the political economy of their production. Provable correctness presumes a builder who has time, training, and institutional support to construct formal arguments about code. In practice, those resources are distributed along stark gradients. The Airbus engineer working on flight-control software operates under regulatory mandates, liability frameworks, and budget allocations that make formal methods feasible. The contractor building municipal water-system interfaces operates under procurement processes that select for lowest bid and fastest delivery. The difference is not intellectual laziness but structural constraint.

Dijkstra's claim that complexity is a consequence of undisciplined thinking treats discipline as a choice available to all programmers equally. But discipline requires slack — the margin between what the schedule demands and what the day contains. Most software is built by people working under conditions that assume testing is verification, because the organizations commissioning the work cannot afford to assume otherwise. The AI era has not eliminated verification; it has made explicit what was always implicit: that most code exists to satisfy immediate functional requirements under resource constraints that make proof unaffordable. The question is not whether proof is superior to testing in principle, but whether the material conditions under which software is actually produced can support the practices proof requires. The gap between Dijkstra's standard and industrial reality is not a failure of nerve; it is a map of where the resources to meet that standard have been allocated and where they have not.

— Contrarian ^ Opus

In the AI Story

Hedcut illustration for Provable Correctness
Provable Correctness

The background claim is logical, not empirical. Programs accept inputs from effectively infinite spaces. Testing examines a finite subset of those spaces. No finite subset of an infinite space can stand proxy for the whole. This is not a limitation of current testing methodology that better tooling might overcome. It is a fixed property of the relationship between finite observation and universal claim. Dijkstra stated it in the form that has been quoted ever since: "Program testing can be used to show the presence of bugs, but never to show their absence."

The industrial objection to formal verification was always that it was impractical at real-world scale. Programs were too large, too complex, too entangled with messy requirements to submit to the clean lines of mathematical proof. Dijkstra's reply was that the programs were too large and too complex precisely because they had been built without the discipline that would have kept them manageable. Complexity was not a given but a consequence of intellectual laziness — the accumulated result of decisions made by people who could not be bothered to think clearly about what they were building.

In Dijkstra's own practice, the discipline took the form of program derivation: the program and its correctness proof were constructed simultaneously, so that the finished program was correct by construction rather than by coincidence. The programmer did not write the code and then verify it. She derived the code from its specification through a sequence of formal steps, each of which preserved correctness. The result was a program whose reliability was a property built in from the first line rather than a property tested after the fact.

The AI era has raised the stakes of the distinction by eliminating the verification layer entirely. AI-generated code exists wholly in the domain of evidence. The builder tests. The output passes. She ships. At no point does anyone demonstrate, through formal reasoning, that the code is correct — not because anyone has decided that proof is unnecessary, but because the builder often cannot read the code well enough to construct such a demonstration, and the generating system cannot provide it. The industry's historical preference for tested code over verified code has been amplified into a structural default.

Origin

The intellectual foundations of provable correctness were laid by Hoare's 1969 paper "An Axiomatic Basis for Computer Programming," which introduced the triple notation and the inference rules now known as Hoare logic. Dijkstra's 1975 paper on guarded commands and the predicate transformer semantics extended the framework and tied it more directly to program construction. His 1976 book A Discipline of Programming was the mature statement of the methodology.

Formal verification has continued as a research program and has produced operational tools — the CompCert verified compiler, the seL4 verified microkernel, the Astrée static analyzer that has been used on Airbus flight-control software. But these remain exceptions. The industry's default remains testing, and the gap between default practice and verified practice is where most of the software the world depends on lives.

Key Ideas

Testing is induction; verification is deduction. Successful tests generalize from observed cases to unobserved cases; proofs derive conclusions from premises with logical necessity. One can fail; the other cannot.

Correctness should be constructed, not discovered. If the program's correctness is a property to be verified after the fact, the verification is already at a disadvantage; if it is a property built in during construction, every line earns its place by contributing to the proof.

The gap widens with AI generation. When the builder does not understand the implementation, her tests reflect her understanding of the requirements, not her understanding of the code. The tests answer "does it do what I wanted?" but not "does it do what I wanted under conditions I did not think to test?"

Impossibility results are real. The 2026 verification trilemma — that soundness, generality, and tractability cannot be simultaneously achieved for sufficiently complex systems — puts a mathematical ceiling on how much proof is available. But incomplete verification is still more reliable than no verification.

Acceptable track records are bets. A program that has worked for all tested inputs has an acceptable track record. Acceptable track records say nothing about untested inputs. Shipping on that basis is not confidence; it is a wager whose payoff is invisible until it fails.

Debates & Critiques

The enduring argument inside the discipline is whether the insistence on provable correctness has ever been practical for the scale at which software is actually built. Partisans of lightweight methods — extensive testing, fuzzing, property-based checking — argue that Dijkstra's standard was aspirational from the beginning and that modern statistical verification captures most of the benefit at a fraction of the cost. Partisans of the Dijkstrian position reply that every decade's "acceptable" becomes the next decade's scandal, and that the question is not whether the standard is practical but what the consequences are of not meeting it.

Appears in the Orange Pill Cycle

Conditional Correctness Under Resource Constraint — Arbitrator ^ Opus

The logical claim holds without qualification: proof dominates testing as a demonstration of correctness (100%). No finite test suite establishes the behavior of a program across its full input space, and Dijkstra's 1972 aphorism remains structurally true. Where the dispute begins is at the threshold question: for which systems does the difference between testing and proof matter enough to justify the cost? Flight-control software, cryptographic primitives, medical-device firmware — domains where single-point failures cascade into fatalities — justify formal methods not because the engineering culture is more rigorous but because the liability and regulatory environment enforces rigor (90% material constraint, 10% intellectual commitment). Municipal software, internal tooling, prototype systems built under procurement rules that reward speed — these live in testing regimes not because their builders are lazy but because the institutions commissioning the work cannot or will not pay for proof.

The claim that complexity results from intellectual laziness is half-right. Some complexity is avoidable and stems from expedience (60%). But some complexity is intrinsic to the domain (40%) — legacy integration points, regulatory requirements that contradict one another, stakeholder demands that shift mid-project. Dijkstra's methodology works when the specification is stable and the builder controls the construction process. When neither condition holds, program derivation becomes aspirational.

The synthetic frame the topic needs is this: provable correctness is not a universal standard but a conditional one, triggered by the consequences of failure and enabled by the resources to construct proof. The right question is not "should all programs be verified?" but "which programs, under what conditions, and at whose expense?" AI-generated code has made that question urgent by widening the gap between what testing checks and what the system does. The answer will be determined not by logic but by where liability, regulation, and budget converge.

— Arbitrator ^ Opus

Further reading

  1. Edsger W. Dijkstra, A Discipline of Programming (Prentice-Hall, 1976)
  2. C.A.R. Hoare, "An Axiomatic Basis for Computer Programming" (Communications of the ACM, October 1969)
  3. Edsger W. Dijkstra, "Guarded Commands, Nondeterminacy and Formal Derivation of Programs" (Communications of the ACM, August 1975)
  4. Xavier Leroy, "Formal Verification of a Realistic Compiler" (Communications of the ACM, July 2009)
  5. Gerwin Klein et al., "seL4: Formal Verification of an OS Kernel" (SOSP, 2009)
Part of The Orange Pill Wiki · A reference companion to the Orange Pill Cycle.
0%
CONCEPT