Standard testing validates that a system does what the specification says it should do. Unit tests check that functions return expected outputs. Integration tests check that components work together under specified conditions. End-to-end tests check that user flows complete successfully. All of these are valuable, and all of them share a blind spot: they validate behavior the specification addressed. They cannot catch behaviors the specification did not anticipate, because the test cases are derived from the specification. When AI-generated code encodes assumptions the specification did not make explicit — which is precisely when leaks occur — standard testing does not surface them.
Concurrency testing is the first category. The practice generates load that forces components to operate simultaneously and looks for race conditions, deadlocks, and data corruption. Tools like Jepsen for distributed systems, or more basic approaches using parallel request generation, subject systems to conditions that the AI's training data likely simplified or omitted. The fintech case's webhook race condition would have been caught by concurrency testing; it was not caught because the team did not run such tests, because the AI-generated code appeared to handle concurrency.
Integration boundary testing is the second, and the most directly targeted at the integration leak. The practice deliberately exercises the interfaces between generated components under varied conditions, hunting for assumption mismatches. It asks: what does component A assume about component B's state? What happens when those assumptions are violated? Can the test suite produce the specific conditions under which the mismatch manifests? The practice is labor-intensive because it requires examining the generated code at the level of the assumptions it silently encodes — which is also where the diagnostic capability the practice preserves actually grows.
Failure injection and chaos engineering (pioneered at Netflix and now practiced at many major infrastructure operators) deliberately disable dependencies to observe whether the system degrades gracefully or cascades. The database becomes unavailable; the cache fails; the external API returns errors. Does the system handle the failures as designed? Or does it reveal that no one designed the failure-handling, because the AI generated the happy-path code and nobody asked about the unhappy paths? Failure injection surfaces this question. Security scanning, the fourth category, addresses the specific leak class of vulnerabilities that postdate the AI's training data — patterns that were secure when the training data was collected and are no longer secure now.
The individual techniques (concurrency testing, integration testing, failure injection, security scanning) are decades old and well-established in high-reliability contexts. The framing of them as a coherent response to AI-generated code's specific leak profile develops in 2025–2026 and is formalized in Chapter 10 of this volume. The Spolsky-lens contribution is not the techniques themselves but the argument that they are no longer optional additions to standard testing; in the AI era, they are the difference between a system whose leaks will be caught during testing and a system whose leaks will be caught in production.
Standard testing validates specified behavior. Leak detection testing probes for behaviors the specification did not address.
Four principal categories. Concurrency, integration boundaries, failure injection, current-threat security scanning — each targeting a specific leak class.
The techniques are not exotic. Each has been used in high-reliability contexts for decades; the novelty is the systematic application to AI-generated code.
The practice grows diagnostic capability. Running these tests requires examining generated code at the level where leaks live, which is also where diagnostic intuition is built.
AI does not test itself this way. The generation process produces code optimized for the happy path; the unhappy path is the human's responsibility to probe.