By Edo Segal
The dashboard I trust most is the one I should trust least.
Green lights across the board. Deployments passing. Tests clean. Revenue climbing. Every metric my organization tracks tells the same story: the system is performing. And performance, in the logic of every company I have ever built or advised, is the evidence that things are working.
Diane Vaughan spent a decade inside the wreckage of the Challenger disaster and emerged with a finding that I cannot stop thinking about: the evidence that things are working is often the mechanism by which they stop working. Not despite the green lights. Because of them. Each successful deployment that was not comprehensively reviewed reinforces the confidence that comprehensive review is unnecessary. Each quarter where the leaner team hits its numbers reinforces the confidence that the old team's redundancy was fat, not muscle. The success is real. The erosion it conceals is also real. And no metric I currently track measures the distance between the two.
She called the mechanism normalized deviance. I call it the thing I was doing before I had a name for it.
In the months since I took the orange pill, I have written about the exhilaration of a twenty-fold productivity multiplier. About engineers reaching across disciplines. About the imagination-to-artifact ratio collapsing to the width of a conversation. I meant all of it. I still mean it. But Vaughan forced me to look inside the multiplier and ask what accommodations made the speed possible. The review that became a scan. The comprehension that became an assumption. The second pair of eyes that became a luxury the sprint could no longer afford. Each one reasonable. Each one invisible. Each one a quarter-inch of erosion in a seal that nobody is measuring because the shuttle keeps flying.
This is not a book about pessimism. Vaughan is not telling you to stop building. She is telling you that the institutions most vulnerable to catastrophic failure are the ones that look, by every available metric, like they are performing brilliantly. That the absence of a disaster is not evidence of safety. That reasonable people making reasonable decisions under production pressure can drift, collectively and invisibly, into a posture that no individual among them would have chosen if they could see the full distance they had traveled.
I needed this lens. Not to replace the exhilaration but to protect it. Because the exhilaration is real, and so is the drift, and the builder who cannot hold both at once is the builder most likely to discover the gap between performance and safety on the morning when the gap is the only thing that matters.
— Edo Segal ^ Opus 4.6
1950-present
Diane Vaughan (1950–present) is an American sociologist whose work on organizational behavior, institutional failure, and the sociology of deviance has shaped how scholars and practitioners understand catastrophic breakdowns in complex systems. Born in 1950, she earned her Ph.D. from Ohio State University and has held faculty positions at Boston College, Columbia University, and currently at Columbia's Department of Sociology. Her landmark work, *The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA* (1996), introduced the concept of "the normalization of deviance" — the process by which organizations incrementally redefine acceptable risk through sequences of individually reasonable decisions until conditions once considered failures become routine. The book, based on nearly a decade of archival research and interviews, challenged the prevailing narrative that the Challenger disaster resulted from managerial wrongdoing, demonstrating instead that the failure was produced by the ordinary operation of institutional culture. Vaughan extended her framework in subsequent work, including *Dead Reckoning: Air Traffic Control, System Effects, and Organizational Change* (2024) and her theoretical contributions on analogical reasoning and the sociology of organizations. Her concept of "structural secrecy" — the way organizational architecture filters and distorts information as it moves between specialized units — has become foundational in fields ranging from aviation safety to healthcare quality. Vaughan's influence extends well beyond academia; her frameworks are now standard references in risk management, systems engineering, and institutional design, and her work is increasingly cited in discussions of artificial intelligence governance and the organizational risks of AI-augmented decision-making.
On the evening of January 27, 1986, engineers at Morton Thiokol in Utah held a teleconference with managers at NASA's Marshall Space Flight Center in Huntsville, Alabama. The forecast for the following morning at Kennedy Space Center called for temperatures in the low twenties — colder than any previous shuttle launch. The engineers had data showing that the rubber O-rings sealing the joints of the solid rocket boosters lost resilience in cold weather. They recommended against launching.
What happened next has been studied, debated, and mythologized for four decades. But the most important thing about that evening is what did not happen. The engineers were not overruled by villains. They were not silenced by corporate greed. They were not ignored by incompetent bureaucrats. They were participants in an organizational process that had been running, with meticulous institutional logic, for years — a process in which the boundaries of acceptable risk had been expanded, flight by flight, until the conditions that would destroy Challenger fell inside limits that the organization had taught itself to consider normal.
Diane Vaughan, the Columbia University sociologist who spent nearly a decade reconstructing the decision-making chain that led to the disaster, gave this process a name: the normalization of deviance. The concept is precise, empirically grounded, and more relevant to the current moment in technology than any framework developed in the decades since. It describes not the failure of an individual but the failure of a system — a system in which reasonable people, acting within institutional constraints, making defensible judgments based on available evidence, produce an outcome that none of them would have chosen and all of them enabled.
The mechanism Vaughan documented operates through four phases. First, an anomaly is observed. On the second shuttle flight, engineers noticed that hot combustion gases had eroded the primary O-ring in one of the booster joints — a condition the design had not anticipated. The O-ring was supposed to maintain a perfect seal. Erosion of any kind was, by the original specification, a failure. Second, the anomaly is assessed. The erosion was minor. The shuttle had flown successfully. The primary ring had not burned through. A secondary ring provided backup. The engineers evaluated the data and concluded that the erosion, while anomalous, fell within limits they judged acceptable given the evidence of successful flight. Third, the anomaly is normalized. The limits of acceptable performance expanded to accommodate the observation. What had been a design violation became an expected condition, documented in reports, discussed in meetings, classified as a known and managed risk. Fourth, the normalized anomaly becomes the new baseline. Future observations of erosion were assessed not against the original specification of zero erosion but against the expanded limits. Each new data point was compared to the previous accepted range, not to the standard the system was designed to meet.
Twenty-four flights of this process produced an organization in which conditions that would have grounded the shuttle on flight two were considered routine by flight twenty-five. No single decision in the chain was indefensible. The engineers were not cutting corners. They were applying judgment to ambiguous data under institutional pressure, and each judgment was consistent with the evidence available at the time. The catastrophe was not the product of one bad decision. It was the product of twenty-four adequate ones, each of which shifted the ground on which the next was made.
The application to the current moment in artificial intelligence is not metaphorical. It is structural. The same four-phase mechanism that Vaughan documented at NASA is operating, in real time, across every domain in which AI tools have been adopted — and the people participating in it are no more aware of the drift than the Thiokol engineers were on the evening of January 27.
Consider the software engineer who begins using Claude Code in January 2026. The first time she receives AI-generated code, she reviews it with the care of a person encountering an unfamiliar system. She reads it line by line. She traces the logic. She checks edge cases. She finds two errors — minor, easily corrected — and concludes that the tool is competent but requires oversight. This is a reasonable judgment, well supported by the evidence.
The second time, her review is somewhat less thorough. The first experience established a baseline of reliability. The code passes its tests. She reads the critical sections, scans the rest, corrects one minor issue. This too is a reasonable judgment. The tool performed well last time. The marginal return on exhaustive review has diminished.
By the tenth iteration, the review has compressed further. She reads the function signatures, checks the test results, glances at the structure. The code has been consistently competent. The deadline is pressing. Other tasks demand her attention. She is not being negligent. She is allocating her finite cognitive resources according to a rational assessment of where those resources are most needed, given the tool's demonstrated track record.
By the fiftieth iteration, the review has become a formality — a gesture preserved in the workflow, emptied of the substance it originally contained. She deploys the code. It works. It has always worked. The accumulated history of acceptable performance has redefined what constitutes adequate oversight, just as the accumulated history of acceptable O-ring erosion redefined what constituted an acceptable seal.
No memo was written authorizing the reduction in oversight. No manager instructed her to review less carefully. No policy changed. The standard drifted, decision by decision, each decision reasonable in isolation, until the practice that was originally understood to be essential became something closer to a memory of a practice — a form without content, a ritual without function.
This is how Vaughan's framework operates: not through dramatic negligence but through the incremental, socially negotiated redefinition of what counts as acceptable. The word "socially" is critical. The normalization does not happen inside a single mind. It happens inside an organization, a team, a professional community. The engineer who reduces her review is not operating in isolation. She is operating in an environment where her colleagues are making the same adjustments, for the same reasons, arriving at the same conclusions. The drift is collective. It is reinforced by the observed behavior of peers, by the absence of negative consequences, by the production pressure that rewards speed and penalizes delay, and by the simple human tendency to calibrate expectations to experience rather than to specifications.
Vaughan's insight — the one that separates her analysis from the thousands of post-disaster investigations that came before — is that the mechanism does not require malice, incompetence, or even negligence in any recognizable form. It requires only the ordinary operation of institutional life: people making reasonable decisions under pressure, with incomplete information, in an environment that rewards proceeding and penalizes stopping. The tragedy is not that someone was reckless. The tragedy is that no one was, and seven people died anyway.
The AI transition is, in Vaughan's terms, a normalization engine of unprecedented power. Every feature that makes AI tools productive — the speed of output, the competence of results, the reduction of mechanical friction — simultaneously accelerates the mechanism by which oversight standards erode. The speed means there is less time for review. The competence means each individual review that is skipped produces no visible consequence. The reduced friction means the cognitive cost of proceeding without review has dropped while the cognitive cost of conducting review has remained constant, creating an economic calculus that tilts, with each iteration, further toward acceptance and further from scrutiny.
Johann Rehberger, a cybersecurity researcher who applied Vaughan's framework directly to AI systems in December 2025, identified the pattern with uncomfortable specificity. Companies, he observed, were treating probabilistic, non-deterministic, and sometimes adversarial model outputs as though they were reliable, predictable, and safe. The models would not consistently follow instructions, maintain alignment, or preserve context integrity, yet organizations were permitting untrusted output to take consequential actions. Most of the time, the output performed adequately. And over time, organizations lowered their guard or eliminated human oversight entirely, because the previous outputs had been acceptable. Each acceptable output reinforced the expectation of the next. The absence of a catastrophic failure was confused with the presence of robust safety.
This confusion — between the absence of failure and the presence of safety — is the cognitive signature of normalized deviance. The Challenger engineers did not believe the O-rings were dangerous, because the O-rings had not yet failed catastrophically. The AI-augmented organization does not believe its reduced oversight is dangerous, because the reduced oversight has not yet produced a catastrophic failure. In both cases, the evidence for safety is negative evidence: the thing that has not happened. And negative evidence, as any scientist will confirm, proves nothing about whether the thing will happen next.
Vaughan's framework predicts that the drift will continue until one of two things occurs: either the accumulated deviance is detected and corrected by institutional structures designed for that purpose, or a trigger event — the cold morning, the edge case, the adversarial input — exposes the gap between the standard the organization believes it is maintaining and the standard it is actually practicing. The gap, invisible under normal operating conditions, becomes the only thing anyone can see the morning after the failure.
The implications extend well beyond software engineering, because the mechanism is not domain-specific. The law firm that uses AI to draft briefs undergoes the same four-phase drift in its review standards. The medical team that uses AI to screen imaging undergoes the same drift in its diagnostic scrutiny. The educational institution that uses AI to assess student work undergoes the same drift in its evaluation of what students actually understand versus what they can produce. In every case, the functional output of the AI system is competent, the review that was originally applied to that output erodes incrementally, and the erosion is invisible because the outputs continue to perform adequately under normal conditions.
The concept matters now — urgently, specifically, practically — because the AI transition has compressed the timeline over which normalization occurs. The Challenger's drift unfolded over twenty-four flights spanning five years. The AI-augmented organization's drift unfolds over weeks. The developer who begins careful review in January may be deploying without review by March, not because she has become less conscientious but because fifty successful iterations in sixty days have taught her nervous system, her team culture, and her organizational workflow that review is a friction the system no longer requires.
The system does still require it. The system always required it. The question is whether anyone will notice the erosion before the temperature drops.
The O-ring that failed on the morning of January 28, 1986, was not a complex component. It was a rubber gasket, roughly a quarter-inch in diameter, designed to seal the joint between two segments of the solid rocket booster. Its function was straightforward: prevent the 5,000-degree combustion gases inside the booster from escaping through the joint and reaching the external fuel tank. The O-ring's simplicity was part of the problem. Because the component was simple, and because its failure mode — erosion under heat — was well understood, the engineering community developed a confidence in its behavior that the accumulating evidence should have undermined.
Vaughan's reconstruction of the decision-making chain reveals that the O-ring erosion was not a secret. It was not buried in classified reports or concealed by a cover-up. It was discussed openly, in teleconferences, in flight readiness reviews, in engineering memoranda that moved through official channels. The erosion was a known phenomenon. What changed, flight by flight, was the organizational meaning assigned to the phenomenon. On early flights, erosion was classified as an anomaly — a deviation from the design specification that required investigation and resolution before the next launch. By the later flights, erosion had been reclassified as an acceptable condition — a feature of the system's actual behavior that, while different from its designed behavior, had not produced a failure and therefore fell within the operational envelope the organization had constructed around its experience.
The reclassification was not arbitrary. It was supported by data, by engineering analysis, by the fact that multiple flights had flown successfully with O-ring erosion. The engineers were not ignoring evidence. They were interpreting evidence within a framework that had expanded to accommodate the anomaly, and the expansion of the framework was itself a product of the institutional pressures — schedule, budget, the burden of proof falling on those who wished to stop rather than those who wished to proceed — that shaped every decision in the shuttle program.
The structural parallel to AI-augmented work is not a loose analogy. It operates at the level of mechanism. In the AI system, the O-rings are the assumptions — typically unstated, rarely examined, progressively relaxed — that connect AI-generated output to the decisions and systems that depend on it.
The first assumption is correctness: the output is accurate. When a developer receives AI-generated code that passes its tests, the assumption of correctness is reinforced. When the code is deployed and functions as expected, the assumption deepens. Each successful deployment is another flight with acceptable erosion. The assumption is tested under normal conditions — normal inputs, normal loads, normal use cases — and found adequate. What the assumption is not tested against are the conditions that exceed the normal range: the edge case, the adversarial input, the interaction between components that was not anticipated by the tests, the load pattern that emerges only at scale, the user behavior that the designer did not imagine.
The second assumption is reliability: the output is consistent. The AI will produce work of similar quality each time. But large language models are probabilistic systems. Their outputs vary with context, with the phrasing of the prompt, with factors that are not transparent to the user. The developer who has received fifty competent code outputs has built a model of the tool's reliability that may not survive the fifty-first — not because the tool has degraded but because the fifty-first prompt happens to land in a region of the model's capability space where its performance is materially different from the region the developer has sampled.
The third assumption is sufficiency: the tests that verify the output are adequate to the risks the output creates. This is perhaps the most consequential assumption, because it conflates one kind of verification with another. Functional testing verifies that the code does what it is supposed to do under the conditions the tests specify. It does not verify that the code handles conditions the tests do not specify. It does not verify that the code's logic is sound in a way that will be resilient under novel circumstances. It does not verify that the code's interaction with other components will be benign across the full range of states the system might enter. These verifications were traditionally performed by the human review process — by the engineer who read the code, understood its logic, and brought to the review a contextual understanding of the system that no test suite could replicate. The assumption that functional testing is sufficient is the assumption that the human review was not performing any function that the tests do not also perform. This assumption is incorrect, and its incorrectness is invisible precisely as long as the conditions remain within the range the tests cover.
The Challenger's O-ring erosion was an observable phenomenon. Engineers could measure it, quantify it, track it across flights. The assumptions embedded in AI-augmented work are less visible because they are structural rather than physical. Nobody measures the assumption of correctness. Nobody tracks the erosion of review standards on a chart. Nobody documents the moment when "review the output carefully" became "check that the tests pass." The drift is not recorded because nobody recognizes it as drift. It is experienced as adaptation — as the natural, sensible adjustment of practice to the demonstrated capabilities of a new tool.
Vaughan's framework is precise about why this matters. The normalized deviance does not produce failure under normal conditions. It produces failure under extraordinary conditions — conditions that the normalized assumptions did not account for because they were never tested against them. The O-rings functioned adequately at temperatures within the range of previous flights. When the temperature dropped below that range, the erosion that had been classified as acceptable became the erosion that destroyed the vehicle.
The AI-augmented system's "temperature drop" could take many forms. A cybersecurity incident in which an adversary exploits a vulnerability in AI-generated code that no human reviewed with sufficient depth to detect. A medical system in which an AI-assisted diagnosis is wrong in a way that the automated verification pipeline was not designed to catch, and the clinician who accepted the diagnosis did so with the reduced scrutiny that months of accurate outputs had normalized. A financial model whose AI-generated assumptions contained a subtle correlation error that, under normal market conditions, produced negligible distortion but under stress conditions amplified into a cascading failure.
None of these scenarios requires the AI to be broken. None requires the model to be poorly designed or the engineers to be incompetent. They require only that the gap between what the organization believes its standards are and what its standards actually are has widened enough that the extraordinary condition finds the system unprotected. The gap is the normalized deviance. The extraordinary condition is the trigger. The failure is proportional to the gap.
Pravin Kothari, writing about enterprise AI risk in late 2025, made the connection explicit. The lack of publicized catastrophic failures, he observed, may lull organizations into believing they are safer than they are — a textbook symptom of normalization of deviance. The observation captures the paradox at the center of Vaughan's framework: the safer the system appears, the more vulnerable it may be, because the appearance of safety is itself a product of the normalization that has eroded the structures providing actual safety. The O-rings appeared safe because they had not yet failed. The appearance was not evidence of safety. It was evidence that the failure conditions had not yet been encountered.
The Challenger investigation revealed that at no point in the normalization process did anyone say, "Let us accept a higher level of risk." The language of the flight readiness reviews was consistently the language of safety. The engineers believed they were maintaining standards. The managers believed the engineering judgments were sound. The organizational process that was designed to catch anomalies and prevent launches under unsafe conditions functioned exactly as designed — and produced a launch under unsafe conditions, because the definition of "unsafe" had been incrementally revised until the conditions of January 28 fell just inside the boundary.
This is the property of normalized deviance that makes it resistant to conventional safety interventions. If the participants recognized they were relaxing standards, the relaxation could be addressed through training, through enforcement, through the conventional mechanisms of institutional accountability. But the participants do not recognize it. They experience themselves as maintaining standards — standards that have been, through the cumulative weight of experience and institutional negotiation, quietly redefined.
The developer who deploys AI-generated code without deep review experiences herself as maintaining responsible practice. The tests pass. The output is competent. She has exercised judgment about where to allocate her attention. By the standards she has internalized — standards shaped by months of successful deployment, by the observed behavior of her peers, by the production pressure that rewards throughput — she is doing her job well.
She is also, in Vaughan's terms, the engineer on flight twenty-four. The O-ring has eroded before, and the shuttle has not failed, and the evidence base for concern is ambiguous, and the schedule is pressing. The decision to proceed is reasonable. It has been reasonable every time. And it will be reasonable on the morning when it is also catastrophic.
The point is not that every instance of reduced review will produce a catastrophe. Most will not, just as most flights with O-ring erosion did not produce a catastrophe. The point is that the mechanism by which the standards erode is indistinguishable, from inside the process, from the mechanism by which standards appropriately adapt to new information. The engineer cannot tell, from her position within the drift, whether she is sensibly calibrating her oversight to the tool's demonstrated reliability or incrementally dismantling the safety structures that protect against the failure the tool has not yet produced. The two processes look identical until the morning they diverge.
The most precise contribution of Vaughan's Challenger research was not the identification of normalized deviance as a phenomenon but the documentation of its mechanism — the specific, replicable, four-phase process by which standards migrate from their original position to a position that accommodates what would once have been considered a failure. The mechanism operates below the threshold of organizational awareness. It is not a decision anyone makes. It is a decision that makes itself, through the accumulated weight of individually reasonable judgments, until the standard that the organization believes it is upholding bears only a nominal relationship to the standard it is actually practicing.
The four phases are observation, assessment, normalization, and baseline shift. Each phase is rational in isolation. The sequence is catastrophic in aggregate. And the mechanism, because it operates through ordinary institutional processes rather than through identifiable failures of judgment, is invisible to the institutions it inhabits.
Applied to the AI transition, the mechanism can be traced with the specificity that Vaughan's methodology demands — not in the abstract but through reconstructed decision chains in specific organizational contexts.
Consider a development team at a mid-size technology company. In January 2026, following the productivity demonstrations that Segal describes in The Orange Pill, the team adopts Claude Code as a standard development tool. The team lead establishes a review protocol: all AI-generated code must be reviewed by a human engineer before deployment. The protocol is documented, discussed in the team meeting, and integrated into the workflow.
Phase one: observation. In the first week, a senior engineer reviews an AI-generated module and finds that it handles a database connection incorrectly — the connection is not closed under certain error conditions, creating a potential resource leak. The error is not catastrophic. Under normal load, it would produce a slow degradation in database performance over time. Under heavy load, it could exhaust the connection pool and bring down the service. The engineer corrects the error, notes it in the code review record, and proceeds. The observation is: AI-generated code contains errors that functional tests do not catch but that human review does.
Phase two: assessment. The team lead reviews the incident. One error in one module in one week. The error was caught by the existing review process. The process worked. The assessment is: the tool is competent but imperfect, and the review protocol is necessary and effective. This assessment is correct. It is also the assessment that will, through its very correctness, contribute to the erosion of the standard it validates.
Because over the following weeks, the team generates more code with AI assistance. The review protocol catches errors at a declining rate — not because the errors are disappearing, but because the accumulation of successful outputs is recalibrating the reviewers' attention. The first review was conducted with the alertness of unfamiliarity. The tenth review is conducted with the efficiency of established practice. The thirtieth review is conducted with the speed of routine.
Phase three: normalization. By the second month, the team lead notices that reviews are taking less time. She asks a senior engineer about it. The engineer explains that the AI's output has been consistently competent, that the error rate has been low, and that she has learned to focus her review on the sections most likely to contain issues — connection handling, error paths, security-sensitive operations — rather than reading every line. This is, by any conventional standard, a reasonable approach to review. Experienced engineers have always focused their attention on the areas of highest risk. The AI's demonstrated competence has provided a rational basis for triaging the review.
But the triaging is a normalization. The original standard — review all AI-generated code — has been replaced by a practical standard — review the sections most likely to contain errors. The practical standard is informed by the AI's track record, which is itself a product of normal operating conditions. The sections the engineer is not reviewing are the sections where the AI has performed well under normal conditions. They are also the sections where an anomaly that only manifests under abnormal conditions would go undetected.
Phase four: baseline shift. By the third month, a new engineer joins the team. She is onboarded into the existing workflow, which includes AI-assisted development and code review. But the review practice she observes and absorbs is the practical standard — the triaged review, the focused attention on known risk areas, the efficient processing of AI output that her colleagues have developed over two months of experience. The original standard — review everything — exists in the documentation she was given on her first day. It does not exist in the practice she observes. The practical standard becomes her baseline. Future drifts in review practice will be measured against this already-drifted standard, not against the original.
The new engineer is not negligent. She is competent, attentive, and acting in perfect alignment with the practices of her team. She is also, in Vaughan's terms, a carrier of the normalized deviance — a person whose baseline has been set by the accumulated drift of an institution she joined after the drift began. She cannot see the gap between the formal standard and the practical standard, because the practical standard is the only one she has experienced.
This is the mechanism by which organizations lose the capacity to detect their own drift. Each generation of participants inherits the standards of the previous generation, but the inherited standards are the practiced standards, not the specified ones. The documentation says one thing. The culture does another. And the culture, because it is reinforced daily by the observed behavior of competent colleagues, is more powerful than any document.
Vaughan demonstrated this dynamic at NASA with forensic precision. Engineers who joined the shuttle program in its later years inherited an organizational understanding of O-ring erosion that had already been normalized. They did not experience the original specification of zero erosion. They experienced a working standard that accommodated some erosion, and that standard was the water they swam in. They could not see it as drift because they had never experienced the original ground.
The same mechanism operates across the domains that AI is transforming. In a law firm that adopted AI-assisted brief drafting in early 2026, the initial practice might have been to verify every citation, check every case reference, confirm every legal argument against the primary sources. The first attorney to use the tool verified everything and found the output largely accurate, with occasional errors in case characterization — errors that, while not immediately harmful, would have been caught by any experienced attorney reading the brief. The verification was time-consuming. The errors were infrequent. The subsequent attorneys verified less, focusing on the sections of the brief that addressed novel legal questions and trusting the AI's accuracy on established precedent.
By the third quarter, new associates joining the firm learned a brief-writing workflow that included AI drafting and selective verification. The selective verification was the practical standard they inherited. They had no experience of the exhaustive verification that preceded it. The formal standard — check everything — remained in the training materials. The practical standard — check what matters most, trust the tool on the rest — became the culture.
Scott Snook, whose work on practical drift built directly on Vaughan's framework, documented the same mechanism in the military. The gap between doctrine and practice, Snook showed, is not produced by disobedience. It is produced by adaptation — by the countless small adjustments that practitioners make to reconcile formal requirements with operational realities. Each adjustment is locally sensible. The accumulated distance between doctrine and practice becomes the space in which catastrophe lives, because the doctrine was designed to prevent the failures that practice is no longer positioned to catch.
In knowledge work, the gap between formal standards and practical standards has a specific character that distinguishes it from the industrial and military contexts Vaughan and Snook studied. The formal standards of knowledge work are, in many domains, implicit rather than codified. There is no flight readiness review for a software deployment. There is no written specification for how carefully an attorney should verify a citation. The standards are embedded in professional training, in the expectations transmitted by mentors and colleagues, in the unspoken norms of what constitutes "good work" in a given field. When these implicit standards drift, there is no document to compare against, no specification to point to, no formal gap to measure. The drift is invisible not only because it is incremental but because the standard against which it should be measured was never written down.
This makes the AI-driven drift in knowledge work particularly insidious. In the Challenger case, the original specification — zero O-ring erosion — existed as a formal engineering requirement. Vaughan could document the gap between the specification and the normalized practice because the specification was explicit. In AI-augmented knowledge work, the equivalent specification — understand what you build, verify what you deploy, comprehend the systems you depend on — exists as a professional norm rather than a formal requirement. There is no document that says a developer must understand every line of code she deploys. There is an expectation, a cultural standard, a professional identity built around the idea that understanding is integral to competence. When that expectation erodes, the erosion is not measured against a formal specification. It is measured against the shifting baseline of what colleagues around the practitioner are actually doing.
The practical consequence is that the drift in AI-augmented knowledge work may be even harder to detect and correct than the drift Vaughan documented at NASA. At NASA, the normalization of deviance operated against a backdrop of formal safety processes that, however inadequate they proved to be, at least provided a structure for identifying and discussing anomalies. In AI-augmented knowledge work, the normalization operates against a backdrop of implicit professional norms that have no institutional mechanism for enforcement, no formal process for review, and no standard against which the drift can be measured except the standard the drift itself has revised.
The standards are drifting now. They are drifting in software engineering, in legal practice, in medical diagnosis, in financial analysis, in education, and in every domain where AI-generated output has become part of the workflow. The drift is rational at every step, invisible to the participants, and cumulative in its effect. And the gap between the standard the organization believes it is maintaining and the standard it is actually practicing widens by a millimeter with each prompt, each review that is slightly less thorough than the last, each new hire who inherits the practical standard and mistakes it for the only standard that has ever existed.
The gap is where catastrophe lives. It is invisible until the extraordinary condition arrives. And by the time it arrives, the gap may be too wide to bridge.
The most dangerous sentence in organizational life is not "ignore the safety protocol." It is "just this once."
The phrase carries within it a claim about scope — the exception is bounded, temporary, justified by specific circumstances that will not recur — and a claim about risk — the exception is small enough that its consequences, if any, will be manageable. Both claims are usually true in the moment they are made. The exception is bounded. The circumstances are specific. The risk is small. And the very truthfulness of these claims is what makes the exception a mechanism of erosion rather than a discrete event, because the next exception will also be bounded, will also be specific, will also carry a small risk, and will be evaluated not against the original standard but against the standard that the previous exception has already revised.
Vaughan's Challenger investigation traced the genealogy of reasonable exceptions with the patience of a historian and the precision of an epidemiologist. She showed that no single decision in the chain that led to the launch was, considered in isolation, unreasonable. Each was a judgment call, made by competent professionals, supported by available data, consistent with the organizational culture and the institutional pressures of the moment. The engineers who accepted O-ring erosion on flight six had data from flights one through five. The engineers who accepted it on flight twelve had data from flights one through eleven. Each exception rested on the accumulated evidence of previous exceptions that had not produced a failure, and the evidence base grew more convincing with each successful flight.
The reasonable exception is the unit of normalized deviance. It is the atom from which the molecule is built. And it is resistant to prevention precisely because it is reasonable — because the person making the exception can point to evidence, to precedent, to a rational cost-benefit analysis that supports the choice to proceed. Telling a competent professional that her reasonable judgment is wrong is a difficult organizational act under any circumstances. Telling her it is wrong when the accumulated evidence supports her position and the immediate cost of the alternative — delay, lost productivity, competitive disadvantage — is visible and quantifiable while the risk of proceeding is speculative and diffuse, is an act of organizational heroism that institutional structures rarely reward.
The AI transition is generating reasonable exceptions at a pace and scale that no previous technological transition has approached, for a reason embedded in the technology itself: AI-generated output is competent. Not perfect, not infallible, but competent — good enough, consistently enough, that the reasonable exception to full review is supported by a growing evidence base of acceptable performance. The developer who deploys AI-generated code without comprehensive review is not being reckless. She has data. Fifty previous deployments. Ninety-eight percent functional correctness. Two errors, both caught downstream, neither catastrophic. Her exception is empirically supported, economically rational, and consistent with the practice of her peers.
It is also, in the aggregate, a millimeter of erosion in the standard that separates a system someone understands from a system that merely functions.
The exceptions in the AI transition can be cataloged with some specificity, and the catalog reveals both their diversity and their structural similarity.
The first category is the review exception. AI-generated output is produced faster than it can be reviewed at the depth the original standard requires. The mismatch between production speed and review speed creates a structural incentive to reduce review depth, because the alternative — reviewing at the original standard while the tool generates at the new speed — means the review becomes the bottleneck, and the bottleneck attracts institutional pressure to resolve itself. The resolution is almost always a reduction in review depth, because increasing review speed to match production speed is not feasible without hiring reviewers or sacrificing comprehension, and the budget rarely accommodates either. The reasonable exception: review the critical sections, trust the tool on the rest.
The second category is the comprehension exception. The tool produces output that the user can evaluate functionally — does it work? does it pass the tests? does it produce the expected result? — but cannot evaluate structurally, because the user lacks the expertise to understand why the output works or under what conditions it might not. Segal describes this directly in The Orange Pill when he recounts building a face detection component through conversation with Claude: he knew what the component needed to do, and the component did it, but the implementation details were beyond the expertise he could have brought to bear independently. The reasonable exception: evaluate the output by its behavior rather than by its logic, because evaluating the logic requires expertise the user does not possess.
This exception is particularly consequential because it is not a departure from an existing standard. It is the creation of a new category of practitioner — a person who directs the creation of technical artifacts without comprehending their internal logic — and the category carries with it a new standard that has no historical precedent for comparison. In the previous technological paradigm, the person who created the code was, by definition, the person who understood it. The creation and the comprehension were inseparable. AI has separated them, and the separation is the exception: the gap between having a working thing and understanding the working thing. The gap is reasonable — not everyone needs to understand every implementation detail — and it is also the space in which undetected failure modes accumulate.
The third category is the team-size exception. When AI tools multiply individual productivity, the economic logic of team composition shifts. If one engineer with AI assistance can produce the output of five engineers without it, the organization faces a decision about whether to maintain the team at five and capture the productivity as expanded capability or reduce the team to one and capture the productivity as cost savings. The reasonable exception to maintaining the original team size is the exception that has the most visible and immediate economic justification: the same output at one-fifth the cost.
But the original team size was not arbitrary. Five engineers produced not only five units of output but five perspectives on each unit — five opportunities for someone to notice something the others missed, five independent evaluations of each critical decision, five sources of the ambient knowledge that prevents accumulation of unnoticed errors. Reducing the team to one reduces the output to one perspective, one evaluation, one source of knowledge. The redundancy that the five-person team provided was not a line item in the budget. It was not measured, reported, or valued by any metric the organization tracks. Its elimination produces an immediate, measurable gain — cost reduction — and an invisible, unmeasurable loss: the reduction of the cognitive redundancy that catches the failures no individual, however competent, can catch alone.
The fourth category is the expertise exception. AI tools enable practitioners to operate in domains adjacent to their training. Segal documents this with precision: the backend engineer who builds frontend features, the designer who writes code, the product leader who creates technical components. Each of these boundary crossings is enabled by the tool and justified by the output — the feature works, the code runs, the component functions. The reasonable exception: if the output is competent, the practitioner's lack of formal training in the domain is not a disqualifying liability.
But formal training in a domain provides not only the skill to produce competent output but the judgment to evaluate it — the capacity to recognize when something that appears to work is fragile, or when a design that satisfies the immediate requirement creates downstream dependencies that a domain expert would have anticipated and avoided. The practitioner operating outside her domain with AI assistance can produce the output. She may not be able to evaluate it with the depth that a domain expert would bring, because the evaluation requires the accumulated, embodied knowledge that domain expertise builds over years of failure and correction. The reasonable exception to requiring domain expertise is an exception to the standard of evaluation, and the erosion of evaluation standards is the central mechanism of normalized deviance.
What makes these exceptions dangerous is not any individual exception's risk. It is the way they compound. The manager who reduces the team size (exception three) has already inherited the reduced review depth (exception one) and the comprehension gap (exception two). The smaller team reviews less carefully the code it already understands less deeply. Each exception rests on the previous ones, and the accumulated weight of reasonable exceptions produces a system that is functioning normally — by the standards the system has taught itself to apply — while the gap between those standards and the standards that would actually protect the system widens with each cycle.
Vaughan showed that the Challenger's destruction was not the product of one exception. It was the product of an organizational culture in which reasonable exceptions had accumulated to the point where the conditions of January 28, 1986, fell just inside the boundary of what the culture had learned to accept. No single exception, removed from the chain, would have prevented the disaster. The disaster was the product of the chain itself — of the sequential, compounding, mutually reinforcing drift that no single exception could cause and no single correction could reverse.
The AI transition is building that chain now, exception by exception, in every organization that has adopted these tools. Each exception is defensible. Each is supported by evidence. Each is consistent with the behavior of peers and the expectations of the institution. And the chain is lengthening, link by link, toward a destination that no individual exception was designed to reach but that the accumulated sequence is approaching with the steady inevitability that Vaughan's framework, applied honestly to the evidence, predicts.
In the summer of 2001, two U.S. Army Black Hawk helicopters flew into the no-fly zone over northern Iraq and were shot down by American F-15 fighters. The incident killed twenty-six people. The subsequent investigation, conducted by Scott Snook, a West Point organizational theorist who had built directly on Vaughan's framework, produced a concept that extended her analysis into territory she had mapped but not fully explored: practical drift.
Practical drift, as Snook defined it, is the gradual divergence between how work is supposed to be done — the formal standard, documented in protocols, trained in classrooms, specified in regulations — and how work is actually done under the pressures, constraints, and improvisations of daily practice. The formal standard represents the organization's considered judgment about what safe and effective performance requires. The practical standard represents the organization's actual behavior, shaped by time pressure, resource constraints, accumulated experience, and the thousand small accommodations that practitioners make to reconcile formal requirements with operational reality.
The gap between the two is practical drift. And the gap is not produced by disobedience. It is produced by the ordinary operation of institutional life — by competent people adapting to their environment, solving the problems in front of them, making the work work. The adaptations are sensible. The accommodations are necessary. The improvisations are creative. And the accumulated distance between what the organization says it does and what it actually does becomes the space in which catastrophe finds room to operate.
Snook's contribution was to show that practical drift is not a failure of discipline. It is a structural feature of complex organizations. The formal standard is designed for a world that holds still. The practical standard evolves in a world that does not. The two diverge because the world changes faster than the protocols that govern it, and the practitioners who inhabit the gap between doctrine and reality adjust their behavior to the reality they experience rather than the doctrine they were taught. This is not negligence. It is competence — the competence of people who have learned that the map is not the territory and who navigate the territory with the judgment they have developed through experience.
The problem is that the judgment developed through experience is calibrated to normal conditions. When conditions exceed the normal range, the practical standard — which was adequate for the conditions the practitioners have experienced — may not be adequate for the conditions they now face. The formal standard, designed with a wider margin of safety, might have been. But the formal standard was abandoned, not in a single decision but through a thousand small adjustments, and its protections were lost before anyone noticed they were needed.
Applied to AI-augmented knowledge work, practical drift takes on characteristics that neither Vaughan nor Snook encountered in their original studies, because the formal standards governing knowledge work are, in many domains, implicit rather than codified. There is no flight readiness review for a software deployment at most companies. There is no regulation specifying how carefully a lawyer must verify AI-generated citations. There is no protocol dictating the depth at which an architect must understand an AI-produced design before approving it. The formal standards of knowledge work exist as professional norms — expectations transmitted through training, mentorship, and the unspoken culture of what constitutes competent practice in a given field.
When these implicit standards drift, the drift is doubly invisible. It is invisible in the way all practical drift is invisible — through the incremental, socially negotiated adjustments that characterize any adaptation to new tools. And it is invisible because the standard against which it should be measured was never formally specified. There is no written document that says a software engineer must understand every line of code she deploys. There is an expectation, rooted in the history of the profession, that understanding is intrinsic to building — that to write code is to know what the code does. The expectation is powerful. It shapes professional identity, hiring criteria, educational curricula, and the internal metric by which practitioners assess their own competence. But it is not a policy. It is not a regulation. It is not a specification against which drift can be measured.
AI has separated what was previously inseparable. Building and understanding were, for the entire history of software engineering, a single activity. The person who wrote the code understood it, because writing was understanding — the act of constructing the logic was the act of comprehending it. The debugging that followed was a deepening of that comprehension, each error encountered and resolved depositing another layer of understanding into the practitioner's evolving model of the system.
When AI writes the code, the coupling breaks. The code exists. The understanding does not — or exists only at the level of functional specification, the knowledge of what the code is supposed to do without the knowledge of how it does it or under what conditions the how might fail. This is the separation that Segal identifies repeatedly in The Orange Pill: the engineer who builds a component through conversation with Claude, who knows what the component does but not why it works, who can evaluate the output by its behavior but not by its logic.
The formal standard of the profession — unstated but deeply held — was that competence encompassed both what and how. The practical standard, emerging under the pressure of AI-augmented speed, is converging on a simpler criterion: does it work? The convergence is practical drift. The formal standard still exists in the culture of the profession, in the training programs that teach engineers to understand systems from the ground up, in the interview processes that test architectural reasoning rather than just output quality. But the practical standard — the standard that governs what people actually do when the deadline is pressing and the tool is producing competent output and the review of underlying logic would take four hours that the schedule does not contain — has migrated toward functional adequacy.
The migration follows the pattern Snook documented in the military context. The practitioners are not disobedient. They are adaptive. They are solving the problem in front of them — shipping the product, serving the client, meeting the deadline — with the tools available to them. The adaptation is competent, even creative. The gap it produces between formal expectations and practical behavior is a structural artifact of the adaptation, not a failure of the individuals performing it.
In the legal profession, the drift takes a specific form. The formal standard for legal research has traditionally included reading the cases one cites — not merely confirming that the citation exists and the holding is correctly stated, but understanding the reasoning, the context, the procedural posture, the ways in which the case might be distinguished by opposing counsel. This standard is rooted in the nature of legal argument: a case cited without understanding is a weapon that can be turned against the person who cited it, because the opposing attorney who has read the case fully will find the distinctions, the qualifications, the concurrences and dissents that reshape its meaning.
AI-assisted brief drafting produces citations efficiently and, for the most part, accurately. The cases are real. The holdings are correctly stated in most instances. The arguments are structured in a way that any judge would find competent. But the attorney who has relied on AI to assemble the brief may not have read the cases with the depth that the formal standard of legal practice requires — the depth that would reveal the vulnerabilities, the distinctions, the arguments-within-arguments that make legal reasoning more than citation management.
The practical standard — check that the citations are real, that the holdings are correctly stated, that the arguments are coherent — is narrower than the formal standard, which includes comprehension of the full case, anticipation of counter-arguments, and the kind of judgment that comes only from having inhabited the case's reasoning rather than summarized its result. The gap between the two standards is the space in which a malpractice claim, a lost case, or a judicial sanction can form — not because the attorney was negligent in any conventional sense, but because the practical standard she applied was adequate for the normal range of cases and inadequate for the case that required the deeper understanding the formal standard was designed to ensure.
In medicine, the drift is emerging in diagnostic imaging. The formal standard for a radiologist reviewing AI-flagged images includes independent assessment — the radiologist forms her own impression before consulting the AI's analysis, maintaining the cognitive independence that prevents anchoring bias. The practical standard, under the time pressure of a caseload that AI has enabled the department to expand, is increasingly to review the AI's analysis first and form an independent impression only when the AI flags an anomaly. The reversal is subtle but consequential: the radiologist's attention is now directed by the AI rather than independent of it, and the conditions under which she would catch an error the AI missed — the false negative, the subtle finding the algorithm was not trained to detect — have been narrowed to the conditions under which she happens to disagree with a tool she has learned, through months of accumulated experience, to trust.
The convergence of practical standards across domains toward the criterion of functional adequacy — does it work? does it pass? does it satisfy? — represents a specific kind of loss that Vaughan's framework illuminates and Snook's extension specifies. The loss is not of output quality. The outputs are competent. The loss is of the margin between competent performance and the edge of failure. The formal standards maintained that margin by requiring understanding, depth, and independent evaluation that exceeded what functional adequacy alone demanded. The practical standards, by converging on functional adequacy, have consumed the margin — not eliminated it in a single stroke but spent it, increment by increment, in the currency of reasonable accommodation.
The margin still exists in the documentation. It exists in the training materials, in the professional codes of conduct, in the rhetoric of quality that every organization deploys and most organizations sincerely intend. What it does not exist in is the practice — the daily, hourly, minute-by-minute reality of how work is done when the tool is fast and the output is competent and the margin between adequate and thorough is an invisible, unmeasured, institutionally unrewarded space that no one has time to occupy.
Snook's term for the outcome of practical drift was "practical incompetence produced by practical competence" — the paradox in which the adaptations that make daily work function smoothly are the same adaptations that, under stress conditions, leave the organization unable to perform the functions its formal standards were designed to ensure. The paradox applies with unsettling precision to the AI transition. The practitioners are competent. The work is excellent by the standards the practice has established. The standards the practice has established are narrower than the standards the profession was built on. And the narrowing is invisible, because the outputs look the same, because the clients are satisfied, because the tests pass, because the system functions — functions normally, under normal conditions, until the morning when the conditions are not normal and the margin that was spent on reasonable accommodation is the margin that was needed.
---
NASA did not pressure its engineers to accept risk. No memorandum was circulated instructing Thiokol's technical staff to relax their safety standards. No manager told Roger Boisjoly, the engineer who fought hardest against the January 28 launch, that his concerns were unwelcome. The organizational record — which Vaughan reconstructed from thousands of pages of documents, transcripts, and interviews — shows something more unsettling than a directive: it shows a culture in which the directive was unnecessary, because the pressure to proceed was embedded in the structure of the institution itself.
The shuttle program operated under a production schedule that had been established to justify the program's funding. The schedule called for a launch cadence that the program had never achieved and, given the technical demands of shuttle preparation, likely could not sustain. But the schedule existed. It was referenced in budget documents, in congressional testimony, in the institutional metrics by which the program's success was measured. Every day a shuttle sat on the pad, the schedule slipped further, and the slip was visible in the reports that traveled up the chain to the people who controlled the program's budget.
This created what Vaughan called an asymmetry of burden. The engineer who wished to proceed with a launch bore no special burden of proof. The evidence for proceeding was the accumulated record of successful flights, the engineering analyses that classified known anomalies as within acceptable limits, and the production schedule that rewarded forward motion. The engineer who wished to stop bore the full burden: she had to demonstrate, with quantitative evidence compelling enough to override the record of successful flights, that the specific conditions of this specific launch exceeded the limits the organization had established. The evidence against launching was, by its nature, harder to produce than the evidence for launching, because the evidence against was predictive — this launch, under these conditions, will produce a failure — while the evidence for was historical — previous launches, under similar conditions, did not produce a failure.
The asymmetry was not a policy. It was a feature of the institutional environment, as ambient and as invisible as the air in the room. The engineers breathed it. The managers breathed it. Everyone involved in the flight readiness review breathed it, and it shaped their judgments without announcing itself as a force.
The AI transition has reproduced this asymmetry with a fidelity that would be remarkable if it were not, in Vaughan's terms, entirely predictable. The production pressure in AI-augmented work is not imposed by a manager's directive or an organizational schedule. It is structural — embedded in the competitive environment, the tool's availability, and the internalized imperative that Segal, drawing on Byung-Chul Han's analysis, identifies as auto-exploitation.
When a development tool enables a twenty-fold productivity multiplier, the multiplier does not remain a private capability for long. The organization adjusts its expectations. The market adjusts its timelines. The competitor who has also adopted the tool ships at the new speed, and the organization that maintains the old pace falls behind by a margin that compounds with every sprint. The production pressure is not a person standing behind the engineer saying "go faster." It is the observed reality that the engineer who does not go faster is producing at a pace that the environment has already redefined as slow.
The asymmetry of burden follows directly. The engineer who wishes to proceed — to deploy AI-generated code after functional testing, to expand into a new domain using the tool's cross-disciplinary capability, to reduce review depth in light of the tool's demonstrated track record — bears no special burden of proof. The evidence for proceeding is the accumulated record of competent output, the passing test suite, and the competitive pressure that rewards speed. The engineer who wishes to stop — to conduct a comprehensive review, to verify comprehension of the output's logic, to maintain the team size that provides cognitive redundancy — bears the burden of justifying the delay. She must demonstrate that the specific risk of proceeding without the additional review exceeds the specific cost of the delay, and she must do so in an environment where the risk is speculative and the cost is measurable.
The burden is not symmetrical. It has never been symmetrical, in any production-oriented institutional environment Vaughan studied. The party that wishes to stop must produce evidence that the party that wishes to proceed is not required to rebut. The party that wishes to proceed needs only point to the record and the schedule. This asymmetry is not designed. It is not intended. It is a structural consequence of operating in an environment where proceeding produces visible, measurable outputs and stopping produces invisible, unmeasurable protections.
What distinguishes the AI-era production pressure from the production pressure Vaughan documented at NASA is the location of the pressure's source. At NASA, the pressure originated in the institutional environment — the launch schedule, the budget cycle, the political expectations that surrounded the program. The engineers could, at least in principle, identify the source and argue against it. They could point to the schedule and say, "The schedule is creating pressure that compromises safety." Whether this argument would have been effective is a separate question — Vaughan shows convincingly that it would not have been, given the institutional dynamics — but the argument was at least articulable. The source of the pressure was external to the practitioner.
In AI-augmented work, the source of the pressure has migrated inward. The developer who works through lunch, who fills the elevator ride with a prompt, who cannot stop building because the tool makes building possible and the internal imperative converts possibility into obligation — this developer cannot point to an external source of pressure, because the pressure originates in the same place as the motivation. Han's framework describes this as the achievement subject who exploits herself and calls it freedom. Vaughan's framework adds the institutional dimension: the self-imposed pressure is reinforced by the observed behavior of colleagues, by the competitive dynamics of the market, by the organizational metrics that reward output and do not measure the quality of the understanding behind it.
The combination — internal drive reinforced by institutional structure — produces a production pressure that is unusually resistant to the interventions Vaughan's framework prescribes. At NASA, the intervention would have been a restructuring of the burden of proof: a culture in which the party that wishes to proceed bears the same evidentiary burden as the party that wishes to stop. In aviation, this restructuring was achieved, partially, through decades of institutional reform that followed disasters and near-disasters. The crew resource management revolution in aviation was, at its core, a redistribution of the burden of proof: any crew member who saw a risk was empowered to stop the operation, and the burden shifted to those who wished to continue.
But when the production pressure is internal — when the person who needs to stop is also the person who wants to proceed, because the work is flowing and the output is good and the state that Csikszentmihalyi called flow is indistinguishable, from inside, from the compulsion that Han called auto-exploitation — the redistribution of burden has no one to redistribute to. The engineer cannot empower herself to stop herself. The institutional structure that would need to impose the pause is the same institutional structure that is being shaped by the practitioners who do not want to pause.
The Ye and Ranganathan study from Berkeley, which Segal examines in The Orange Pill, documents the behavioral signature of this dynamic. Workers did not report being pressured by managers. They reported being unable to stop — pulled forward by the tool's capability, by the satisfaction of producing at a pace they had never achieved, by the awareness that the task they were about to start could be finished in the time it used to take to prepare to start it. The production pressure was not external. It was the experience of capability itself, converted by the institutional environment and the internal imperative into a force that resembled enthusiasm and functioned as compulsion.
Vaughan's framework predicts the consequence: when the production pressure is strong enough and persistent enough, it reshapes the institutional culture until the culture treats the pressure as normal — not as a force to be resisted but as a feature of the work to be embraced. NASA's culture, by the time of Challenger, had absorbed the launch schedule so completely that the schedule was no longer experienced as pressure. It was experienced as reality — the speed at which the program operated, the pace at which decisions were made, the timeline within which evidence had to be evaluated. The culture did not resist the schedule. The culture was the schedule.
The AI transition is producing the same cultural absorption. The speed at which AI tools operate is becoming the speed at which organizations expect work to be done. The pace of production is becoming the pace of evaluation. The timeline within which code must be reviewed, designs must be assessed, analyses must be verified is compressing to fit the timeline within which the code was generated, the design was produced, the analysis was assembled. The compression is not a decision anyone made. It is the predictable consequence of a production environment in which the tool operates at a speed that outpaces the review structures designed for the previous speed, and the institutional pressure to match the tool's pace is structural, continuous, and felt by everyone without being imposed by anyone.
The culture of production pressure does not announce itself. It does not send a memorandum. It simply becomes the water the institution swims in — the ambient expectation that shapes every judgment, every allocation of attention, every assessment of what constitutes an adequate review. The engineer who conducts a thorough, time-consuming review of AI-generated code is not violating any policy. She is, however, operating at a pace that the culture has already moved past, and the friction between her pace and the culture's pace is a force that acts on her judgments in the same way the launch schedule acted on the judgments of the Thiokol engineers: not as a directive but as an environment, not as a command but as gravity.
---
There is a specific kind of knowledge that cannot be transmitted through documentation, training, or instruction. It is the knowledge that accumulates in a practitioner's nervous system through years of encounter with the material of her craft — the feeling for code that a senior engineer develops through thousands of hours of debugging, the instinct for structural weakness that an experienced architect develops through years of watching buildings respond to forces the blueprints did not anticipate, the diagnostic intuition that a seasoned physician develops through decades of listening to symptoms describe themselves through the bodies of patients who cannot fully articulate what is wrong.
This knowledge has no formal name in most organizational vocabularies, though it has been studied under various labels: tacit knowledge, in Michael Polanyi's formulation; embodied cognition, in the language of contemporary cognitive science; dead reckoning, in the term Vaughan adopted from navigation to describe the way air traffic controllers build and maintain a cognitive model of the airspace that goes beyond what the instruments display.
The defining characteristic of this knowledge is that it cannot be separated from the process that produces it. The debugging does not just fix the code. It deposits understanding. The encounter with structural failure does not just reveal the weakness. It calibrates the architect's intuition for the next design. The diagnostic interaction with a patient does not just produce a diagnosis. It enriches the physician's model of how disease presents, a model that is updated with each patient and that, over the course of a career, becomes a perceptual instrument more sensitive than any protocol or checklist.
Vaughan documented this form of knowledge extensively in Dead Reckoning, her study of air traffic control. Controllers, she found, maintained a dynamic cognitive model of the airspace — a three-dimensional, temporally evolving representation of where every aircraft was, where it was going, and where it would be in relation to every other aircraft at every point in the near future. This model was not derived from the instruments alone, though the instruments provided essential data. It was constructed from the instruments, from the controller's experience with similar traffic patterns, from the feel of the flow — the rhythm and pace of the traffic that an experienced controller could sense in the way a musician senses the tempo of a piece.
When automation entered the air traffic control system, something changed that the designers of the automation did not anticipate. The automation provided more data, more precisely, more reliably. It improved the informational basis of the controller's work. But it also reduced the occasions on which the controller needed to construct the cognitive model independently — the occasions on which the controller's own dead reckoning was the primary source of situational awareness. The reduction was gradual. Each increment of automation replaced a task that had previously required the controller's active cognitive engagement with a task that required only the controller's monitoring of the automated system's output.
The controllers noticed. They reported, in Vaughan's interviews, that the automation made their work easier in the ordinary sense — less physically taxing, less cognitively demanding under normal conditions — while simultaneously making it harder in a way they struggled to articulate. The difficulty was not in the tasks the automation performed. It was in the tasks the automation had taken away — tasks that had seemed, before the automation, like burdens, but that were, in retrospect, the occasions on which the controller's situational awareness was built and maintained. The debugging, the manual calculation, the mental projection of traffic patterns that the automation now handled — these were not just work. They were the work that produced understanding. And without them, the understanding thinned.
The controllers described this thinning in experiential terms: they felt less connected to the traffic, less confident in their own assessment of the situation, more dependent on the automated display and less able to detect when the display was wrong. The automation was more reliable than the controller under normal conditions. But the controller's dead reckoning — the independent cognitive model that served as a check on the instruments — had atrophied through disuse, and the atrophy was invisible until the moment the instruments were wrong and the controller's independent assessment was needed and was no longer reliable.
The parallel to AI-augmented knowledge work is structural. In software engineering, the equivalent of the controller's dead reckoning is the developer's understanding of the codebase — the mental model of how the system works, where its vulnerabilities lie, how its components interact under various conditions, what will break when something changes and what will hold. This understanding was built, before AI, through the same process that built the controller's situational awareness: through direct engagement with the material, through debugging, through the frustrating, time-consuming encounter with error that forced the developer to understand not just what the code did but why it did it and what it would do under conditions the tests had not anticipated.
Claude Code removes the debugging. Or more precisely, it removes the developer's need to debug — the tool generates code that works, and when it does not work, the tool can often diagnose and correct its own errors faster than the developer could. Each removal is an efficiency gain. Each is also the removal of an occasion on which the developer's understanding of the system would have been built or maintained. The efficiency and the erosion are the same event, viewed from different vantage points.
The result is a form of knowledge loss that is structurally identical to what Vaughan documented in air traffic control: the practitioner operates in a system she understands less deeply than her predecessors did, not because she is less capable but because the occasions that would have built the understanding have been automated away. The system functions. The practitioner monitors the output. The monitoring is competent. But the independent cognitive model that would serve as a check on the output — the dead reckoning that would enable the practitioner to detect when the automated system is wrong — has not been built, because the process that builds it has been replaced by the tool that made it unnecessary.
Segal identifies this dynamic in The Orange Pill when he describes the geological metaphor: hours of debugging depositing thin layers of understanding that accumulate into something solid enough to stand on. The metaphor is precise. The layers are deposited by friction — by the resistance of the material to the practitioner's intention. When the friction is removed, the deposition stops. The surface may look the same, but the ground beneath it is hollow.
The hollowing is the normalized deviance. The original standard — unstated but deeply embedded in the professional identity of every knowledge worker — was that competence included understanding. To be a competent engineer was to understand the systems one built. To be a competent attorney was to understand the law one applied. To be a competent physician was to understand the diagnostic reasoning behind one's conclusions. Understanding was not an optional supplement to competence. It was constitutive of it. The two were inseparable because the process of achieving competence was the process of building understanding.
AI has made the separation possible. A practitioner can now produce competent output — code that works, briefs that persuade, diagnoses that are correct — without the understanding that the pre-AI standard considered integral to the competence itself. The output is indistinguishable. The surface holds. The understanding beneath it is either absent or attenuated, and the difference between the two states is invisible to any evaluation that measures the output rather than the process that produced it.
This invisibility is what makes the normalization so difficult to detect and so resistant to correction. No metric currently in widespread use measures the depth of a practitioner's understanding. Organizations measure output: code shipped, briefs filed, patients seen, features deployed. They do not measure comprehension: the practitioner's ability to explain why the code works, to anticipate how the brief might be attacked, to identify the conditions under which the diagnosis might be wrong. The metrics that are tracked show no degradation, because the outputs are competent. The metric that would reveal the degradation — the depth of understanding behind the output — is not tracked, because it was never necessary to track it when the process that produced the output was the same process that produced the understanding.
When Vaughan studied the air traffic controllers who struggled with the effects of automation, she found that the most experienced controllers — the ones who had built their dead reckoning through decades of pre-automation practice — were the most articulate about what was being lost. They could feel the atrophy. They could name it, if imprecisely. The newer controllers, trained in the automated environment, had no experiential basis for comparison. They did not feel a loss because they had never possessed the thing being lost. Their baseline was the automated environment, and within that environment, their performance was competent by the standards the environment established.
This generational gap is reproducing itself in every domain AI is transforming. The senior engineer who spent years building her understanding through manual debugging can feel the thinning. She knows, from embodied experience, what the layers of understanding feel like and what their absence means. The junior engineer, trained with AI assistance from the outset, has no comparable experience. Her baseline is the AI-augmented environment. Her standard of competence is the standard that environment has established. She cannot miss what she never had, and the organization cannot measure what it has never tracked.
The standard has drifted. Understanding has become optional — not by decree, not by policy, but by the accumulated weight of a production environment in which understanding is not required for the output to be competent and is not rewarded by the metrics that determine the practitioner's evaluation. The practitioner who takes the time to understand is not penalized. She is simply operating at a pace the environment has moved past, investing in a capacity the organization does not measure, building a form of knowledge that will be invisible until the morning it is the only thing that matters.
---
In May 2023, a lawyer in New York named Steven Schwartz filed a legal brief in a federal case that cited six judicial decisions. The citations were formatted correctly. The case names were plausible. The holdings were stated with the confident specificity that a judge expects. The brief was, by every surface measure, competent legal work.
None of the cases existed. The lawyer had used ChatGPT to research the brief, and the model had generated fictitious citations — cases that sounded real, with holdings that supported the argument, but that had no existence in any court's records. The opposing counsel checked the citations, found nothing, and brought the fabrication to the court's attention. The judge sanctioned Schwartz, and the incident became a cautionary tale that traveled through the legal profession with the speed of genuine alarm.
The Schwartz incident is instructive not because it is typical but because it is the visible edge of a phenomenon that is usually invisible. The fabricated citations were caught because they did not exist — because the failure mode was detectable through a straightforward verification that anyone could perform. The more consequential gap between competence and comprehension produces failures that are not fabrications but distortions — outputs that are real, that cite real sources, that reach defensible conclusions, but that contain subtle mischaracterizations, overlooked qualifications, or structural weaknesses that only a practitioner with genuine comprehension of the underlying material would detect.
This is the gap that Vaughan's framework identifies as the breeding ground of normalized deviance applied to knowledge itself. The standard, embedded in every profession that depends on intellectual judgment, was that competence and comprehension were inseparable. To be competent was to comprehend — to understand the reasoning behind one's conclusions, the evidence supporting one's claims, the conditions under which one's analysis might fail. AI has made it possible to produce competent output — output that meets every functional criterion of professional adequacy — without the comprehension that competence traditionally required and that the profession historically assumed.
The gap between the two is not always visible. In many cases, it is not visible at all. Competent output and comprehended output look identical in a repository, in a filing, in a medical record, in a design document. The code that the developer understood and the code that the developer received from Claude Code and deployed after functional testing are indistinguishable to any observer examining the output. The brief that the attorney researched through weeks of case reading and the brief that the attorney assembled through AI-assisted drafting with selective verification are indistinguishable to any reader evaluating the argument. The diagnosis that the physician reached through independent clinical reasoning and the diagnosis that the physician confirmed after reviewing the AI's assessment are indistinguishable in the patient's chart.
The indistinguishability is the mechanism by which the gap normalizes. Because the outputs look the same, the processes that produce them are treated as equivalent. Because the processes are treated as equivalent, the standards governing them converge. Because the standards converge, the gap between competence and comprehension widens without producing any observable signal that the widening has occurred.
Vaughan documented a structurally identical phenomenon in the Challenger case. The O-ring erosion on early flights and the O-ring erosion on later flights looked the same in the engineering reports — measured in the same units, documented in the same format, assessed against the same (progressively expanded) criteria. But the relationship between the observation and the organizational response had changed fundamentally. On early flights, erosion prompted investigation, engineering analysis, and active assessment of whether the system's safety margin was adequate. On later flights, erosion prompted documentation — the recording of a known phenomenon within established parameters. The observation looked the same. The organizational processing of the observation had been hollowed out, and the hollowing was invisible because the documentation maintained its original form.
The AI-augmented knowledge worker's competence-comprehension gap follows the same hollowing pattern. The output maintains its original form — the code is structured correctly, the brief is argued coherently, the diagnosis is documented properly. The process behind the output has been hollowed out — the understanding that informed the structure, the case knowledge that grounded the argument, the clinical reasoning that supported the diagnosis has been attenuated or replaced by the AI's processing, which the practitioner has evaluated functionally rather than substantively.
The gap matters because comprehension is what enables diagnosis of failure. When a system breaks, the person who comprehends the system can identify where it broke, why it broke, and how to fix it. The person who merely operates the system — who can describe its intended behavior but cannot explain its mechanism — is dependent on the system's own diagnostic capabilities or on finding someone who does comprehend it. In an environment where the tool has comprehended on the practitioner's behalf, finding that someone becomes progressively harder, because the pool of practitioners with genuine comprehension shrinks as the tool's adoption widens.
This is the scenario that cybersecurity researchers like Rehberger have identified as the structural risk of AI-augmented systems. The system functions. The system has always functioned. The system functions so reliably that the human oversight designed to catch its failures has been normalized away. When the system fails — when the adversarial input arrives, when the edge case manifests, when the conditions exceed the range in which the system's reliability was established — the human who was supposed to catch the failure does not possess the comprehension to detect it, because the comprehension was never built or has atrophied through disuse.
The pattern extends into domains where the consequences of the gap are direct and physical. In healthcare, AI-assisted diagnostic tools achieve accuracy rates that meet or exceed the accuracy of individual clinicians for many conditions. The functional competence of the output is established by validation studies, regulatory review, and the accumulated experience of deployment. But the clinician who relies on the tool's diagnostic assessment without forming an independent clinical impression — without engaging the reasoning process that the tool has replaced — is operating in the competence-comprehension gap. The output is competent. The clinician's understanding of why the diagnosis is correct, and therefore her ability to recognize when it might be wrong, is diminished by the extent to which the tool's competence has replaced her own diagnostic reasoning.
The formal standard in medicine is explicit: the physician is responsible for the diagnosis, regardless of the tools used to inform it. The practical standard, under the production pressure of a caseload that AI-assisted screening has expanded, is drifting toward a model in which the physician's role is to review the AI's assessment rather than to form an independent one — to check the output rather than to reproduce the reasoning. The drift is rational. The tool is accurate. The caseload is pressing. The independent assessment takes time the schedule does not contain. Each accommodation is reasonable. The accumulated effect is a medical system in which the comprehension behind the diagnosis is thinner than the formal standard assumes and thinner than the patient's safety may require under non-routine conditions.
The gap between competence and comprehension is, in Vaughan's terms, a normalized deviance of a specific and particularly dangerous kind: a deviance in the relationship between the practitioner and the knowledge that undergirds her practice. The standard that has drifted is not a procedural standard — not a checklist item or a process step that has been skipped — but an epistemic standard: the standard of knowing, of understanding, of comprehending the systems and reasoning that produce the outputs on which decisions depend.
Epistemic standards are harder to restore than procedural ones, because epistemic standards are maintained by practice rather than by policy. A checklist can be reinstated by a memo. Understanding can only be reinstated by the process that builds understanding, and that process — the slow, friction-rich, iteratively deepening engagement with the material — is precisely the process that AI tools have made unnecessary for the production of competent output.
The Schwartz case was caught because the failure was binary: the citations existed or they did not. The failures that the competence-comprehension gap produces under ordinary conditions are not binary. They are marginal — a brief that is persuasive but vulnerable to a counter-argument the attorney did not anticipate, code that functions but contains a dependency the developer did not understand, a diagnosis that is correct but rests on a reasoning chain the physician could not reproduce if the AI's assessment were challenged. These marginal failures accumulate in the same way O-ring erosion accumulated: each one, individually, is within the bounds of what the system can absorb; the question is what happens when the accumulated marginal failures interact, compound, and encounter the condition that exceeds the system's capacity to absorb them.
The answer, in Vaughan's framework, is that the system fails in proportion to the gap. And the gap, at this moment, is widening — not through negligence, not through incompetence, but through the ordinary operation of institutions adapting to tools that have made understanding optional for the production of output that, by every measurable standard, is competent enough.
The information that could have prevented the Challenger disaster existed within the organizations responsible for the launch. It was not hidden. It was not classified. It was not suppressed by a conspiracy of silence. It was available — documented in engineering reports, discussed in technical meetings, recorded in memoranda that traveled through institutional channels designed to surface precisely this kind of data. The information existed, and it did not reach the people who needed it, and the reason it did not reach them was not a failure of communication but a feature of organizational structure.
Vaughan named this phenomenon structural secrecy: the way the architecture of an organization — its divisions, its hierarchies, its reporting channels, its specialized vocabularies, its criteria for what constitutes relevant information at each level of decision-making — shapes what information survives the journey from the point of origin to the point of decision. The shaping is not deliberate. It is structural. It is produced by the ordinary operation of organizational life — by the fact that complex institutions divide labor, specialize knowledge, and filter information as it moves between units, because the alternative, routing all information to all decision-makers without filtration, would overwhelm the system's capacity to process it.
The filtering is necessary. The filtering is also the mechanism by which critical signals are lost. At NASA, the engineers at Morton Thiokol who understood the O-ring data most deeply communicated their concerns through channels that translated technical specificity into managerial summary, quantitative nuance into categorical assessment, conditional probability into binary judgment. The translation was not malicious. It was the ordinary consequence of information moving between communities with different vocabularies, different standards of evidence, and different frameworks for interpreting data. By the time the O-ring data reached the flight readiness review, it had been processed through enough organizational layers that the conditional, uncertain, deeply worried assessment of the engineers had become a categorical classification: the erosion was within acceptable limits.
The engineers' worry — the inarticulate, experience-based sense that something was not right, that the data was trending in a direction the formal analysis could not capture — did not survive the translation. Not because anyone suppressed it, but because the organizational structure had no channel for transmitting it. The structure was built to transmit data, analyses, and classifications. It was not built to transmit the feeling that an experienced engineer develops through years of working with a material — the dead reckoning that says "the numbers are within limits, but the limits are wrong."
AI introduces a new dimension of structural secrecy that extends Vaughan's concept into territory her original analysis did not anticipate. The organizational structural secrecy that Vaughan documented operates between people — between the engineer who knows and the manager who decides, between the specialist who sees the anomaly and the generalist who assesses the risk. AI structural secrecy operates between the tool and the person who depends on it. It is not organizational but technological, and its implications for the normalization of deviance are, in important respects, more severe.
When a developer reviews code written by a human colleague, the review is an act of comprehension. The reviewer reads the logic, traces the reasoning, evaluates the design decisions. If something seems wrong — if the logic contains an assumption the reviewer does not share, or the design makes a trade-off the reviewer would not have made — the reviewer can interrogate the author. Why did you handle the error this way? What happens if the input exceeds this range? Have you considered the interaction with the authentication module? The author can explain. The reasoning is accessible, even if it requires effort to extract.
When a developer reviews code generated by Claude, the reasoning behind the code is not accessible. The model produced the output through a process that is, in a fundamental sense, opaque — not because the process is deliberately hidden but because the architecture of the model does not produce reasoning in a form that can be inspected, interrogated, or verified at the level of individual decisions. The developer can observe the output. She can test the output. She can evaluate whether the output does what it is supposed to do. What she cannot do is inspect why the model made the specific design choices it made, what alternatives it considered and rejected, what assumptions it embedded in the implementation, or what conditions might cause those assumptions to fail.
This opacity is not a flaw in the current generation of models that will be corrected in the next. It is a structural property of the technology. Large language models generate output through statistical processes operating over billions of parameters. The "reasoning" that produces the output is distributed across the model's architecture in a way that does not decompose into the sequential, inspectable chain of decisions that characterizes human reasoning about code. Interpretability research is active and important, but the current state of the art does not provide the kind of granular, decision-level transparency that would allow a reviewer to evaluate the model's reasoning with the same depth she would apply to a human colleague's code.
The consequence is a form of structural secrecy that is embedded in the tool rather than in the organization. The organization can be restructured. Reporting channels can be redesigned. Information flow can be improved. But the opacity of the model's reasoning is not an organizational artifact. It is a technological one, and no organizational reform can fully eliminate it, because the secrecy is a property of the medium through which the output is produced.
This technological structural secrecy interacts with the organizational normalized deviance in a way that compounds both. The developer who has normalized reduced review depth (Chapter 3) is reviewing an output whose reasoning she cannot inspect even when she reviews thoroughly (this chapter). The two limitations are multiplicative. The reduced review means she is less likely to detect surface-level anomalies. The opacity of the model means she cannot detect reasoning-level anomalies even if she looks. The combined effect is a system in which the human oversight layer — the layer that was designed to catch the failures the automated system cannot catch — has been weakened by normalization at the behavioral level and structurally limited at the technological level.
Vaughan's analysis of structural secrecy at NASA emphasized that the secrecy was not absolute. The information existed. It was accessible, in principle, to anyone who knew where to look and had the technical vocabulary to interpret it. The problem was that the organizational structure made it unlikely that the right person would look in the right place at the right time with the right interpretive framework. The secrecy was probabilistic, not deterministic — a function of organizational architecture that reduced the likelihood of critical information reaching critical decision points below the threshold of reliable detection.
AI structural secrecy has a different character. It is not that the information exists somewhere in the system but is unlikely to reach the right person. It is that certain kinds of information — the reasoning behind the model's decisions, the alternatives it considered, the assumptions it embedded — do not exist in an inspectable form at all. The developer cannot find the model's reasoning by looking harder, by restructuring the organization, by building better information channels. The reasoning, in the sense that a human reviewer would need to inspect it, is not there to be found.
This means that the defenses against structural secrecy that Vaughan's framework prescribes — improving information flow, restructuring reporting channels, creating cross-functional review processes — are necessary but not sufficient in the AI context. They address the organizational component of the problem, the component that is amenable to institutional reform. They do not address the technological component — the irreducible opacity of the model's processing — which requires a different kind of defense: not better information flow but better structural redundancy, not transparency of reasoning but independence of evaluation.
The independence of evaluation — the second pair of eyes that forms its own assessment rather than reviewing the first pair's work — becomes the critical defense when the reasoning behind the first assessment is not inspectable. If the model's reasoning cannot be evaluated, the output must be evaluated through independent means: through testing that exceeds functional verification, through domain experts who form their own assessment before seeing the model's output, through adversarial review processes that specifically seek the conditions under which the output might fail. These defenses are expensive. They are slow. They resist the production pressure that the tool itself generates. And they are the only defenses that address the specific character of AI structural secrecy — the opacity that no organizational reform can penetrate.
The structural secrecy of the AI system has a second consequence that extends Vaughan's framework into new territory: it makes the normalization of deviance harder to detect after the fact. In the Challenger investigation, Vaughan was able to reconstruct the decision chain because the decisions were documented — in meeting minutes, in memoranda, in the engineering reports that recorded each assessment of O-ring erosion. The reasoning was inspectable retrospectively, even though it had not been inspected adequately in real time. The investigation could trace the drift because the drift left tracks.
In an AI-augmented system, the drift leaves different tracks — or, in important respects, no tracks at all. The model's reasoning for a specific output is not recorded in a form that a post-incident investigation could reconstruct. The developer's review of that output — whether thorough or cursory, whether independent or anchored — is typically not documented with the granularity that would reveal its depth. The interaction between the two — the opacity of the model's reasoning and the depth of the human's review — exists in the moment of the interaction and then dissipates, leaving behind only the output, which is competent, and the deployment record, which shows that the output was accepted.
When the failure arrives — when the edge case manifests, when the adversarial input finds the vulnerability, when the conditions exceed the range in which the system's reliability was established — the investigation will find competent output, passing tests, and a deployment record that shows the process was followed. What it will not find, because the information does not exist in recoverable form, is the chain of reasoning that produced the output, the depth of the review that accepted it, or the specific point at which the gap between the system's competence and its comprehension became the gap through which the failure passed.
The structural secrecy of the AI system is not just a barrier to prevention. It is a barrier to learning. And an institution that cannot learn from its failures is an institution that is structurally condemned to repeat them.
---
Vaughan's framework does not predict specific catastrophes. It predicts structural conditions — conditions under which catastrophe becomes not certain but probable, not imminent but inevitable over a sufficiently long timeline given the continuation of the mechanism she documented. The distinction matters, because the temptation to predict specific failures is the temptation to make the argument falsifiable on the wrong terms. If the predicted failure does not materialize in the predicted form, the argument is dismissed, and the structural conditions that made the prediction rational continue to operate unaddressed.
What Vaughan demonstrated is that normalized deviance produces a specific relationship between apparent safety and actual vulnerability. The relationship is inverse: the longer the system operates without a failure, the more confident the institution becomes in its safety, and the more the safety mechanisms that protect against failure are eroded by the confidence that the mechanisms are no longer necessary. The apparent safety is itself the condition that produces the vulnerability, because the apparent safety is what justifies each incremental relaxation of the standards that maintain actual safety.
At NASA, twenty-four successful flights with O-ring erosion produced an institutional confidence that was precisely calibrated to the wrong quantity. The confidence was calibrated to the frequency of past failures — zero in twenty-four flights — rather than to the margin between past performance and the boundary of the system's capability. The margin had been narrowing with each flight, as the conditions under which the shuttle launched expanded to include temperatures, pressures, and configurations that moved progressively closer to the edge of the O-ring's performance envelope. But the margin was not measured, because the metrics that the institution tracked — successful flights, anomalies resolved, flight readiness reviews completed — did not capture it. The metrics captured performance. They did not capture the distance between performance and failure.
The AI transition exhibits the same inverse relationship between apparent safety and actual vulnerability. Every successful deployment of AI-generated code that was not comprehensively reviewed reinforces the confidence that comprehensive review is unnecessary. Every successful operation of a system whose components are understood functionally but not mechanistically reinforces the confidence that mechanistic understanding is a luxury rather than a necessity. Every quarter in which the reduced team produces the same output as the original team reinforces the confidence that the original team's cognitive redundancy was excess capacity rather than a safety margin.
The confidence is empirically grounded. The deployments were successful. The system operated correctly. The reduced team met its targets. But the confidence is calibrated to the wrong quantity — to the frequency of observed failures rather than to the margin between the system's current operating conditions and the conditions under which the accumulated normalizations would produce a failure the system could not absorb.
What would such a failure look like? Vaughan's methodology resists speculation, insisting instead on the structural analysis that makes the speculation unnecessary. The specific form of the catastrophe is less important than the structural conditions that enable it. But the conditions themselves can be described with precision, and from the conditions, the contours of potential failures become legible.
The first condition is the comprehension gap: a system in which no individual or team possesses complete understanding of how the system's components work, why they were designed as they were, and how they interact under conditions that exceed the normal operating range. This condition is produced by the normalization of the comprehension exception (Chapter 4) and the progressive separation of competence from understanding (Chapters 7 and 8). The gap exists now, in every organization where AI-generated components have been deployed without the mechanistic understanding that pre-AI development processes would have produced.
The second condition is the review deficit: a system in which the human oversight designed to catch failures the automated processes miss has been eroded to the point where it no longer performs its protective function. This condition is produced by the normalization of the review exception (Chapter 4) and the production pressure that makes thorough review a competitive liability (Chapter 6). The deficit exists now, measurably, in the declining depth and duration of code review, design critique, and quality assessment across AI-augmented organizations.
The third condition is the redundancy gap: a system in which the cognitive redundancy — the multiple independent perspectives that catch the failures any single perspective would miss — has been reduced below the threshold at which it provides reliable protection. This condition is produced by the team-size exception (Chapter 4) and the structural incentives that convert productivity gains into headcount reduction rather than expanded capability.
The fourth condition is the opacity barrier: a system in which the reasoning behind critical components is not inspectable, not because the information has been withheld but because the technology that produced the components does not generate reasoning in an inspectable form (Chapter 9). This condition is inherent in the architecture of the tools and is not addressable through organizational reform alone.
When these four conditions coexist — and they coexist now, in varying degrees, across the industries that have adopted AI tools — the structural prerequisites for a Vaughan-type failure are in place. The specific trigger could be a cybersecurity incident in which an adversary exploits a vulnerability in AI-generated code that was deployed without the depth of review that would have detected it, in a system whose architecture was not fully understood by the team operating it, with no independent verification layer to catch the exploitation before it propagated. It could be a medical event in which an AI-assisted diagnosis was subtly wrong in a way that the clinician who confirmed it could not detect because her diagnostic reasoning had been supplanted by the tool's assessment, in a department where the caseload expansion enabled by AI had reduced the time available for the independent evaluation that would have caught the error. It could be a financial incident in which AI-generated models contained correlated assumptions that, under stress conditions, amplified rather than diversified risk, in an organization where the quantitative analysts who would have detected the correlation had been reassigned or reduced as the AI tools made their traditional role appear redundant.
Each of these scenarios follows the same structural pattern: the system functions normally until the extraordinary condition arrives, and the accumulated normalization of deviance — the eroded review, the comprehension gap, the reduced redundancy, the opacity barrier — has hollowed out the protections that would have detected the condition and prevented the failure. The failure is proportional not to the trigger but to the accumulated deviance — to the distance between the standards the organization believes it is maintaining and the standards it is actually practicing.
Vaughan's framework offers no comfort in the observation that most flights did not end in disaster. Twenty-four successful flights were not evidence of safety. They were evidence that the failure conditions had not yet been encountered. The twenty-fifth flight encountered them. The observation that most AI-augmented operations succeed is, by the same logic, not evidence that the accumulated deviance is benign. It is evidence that the trigger event has not yet arrived.
The question is not whether the trigger will arrive. In a system of sufficient complexity, operating over a sufficient duration, the extraordinary condition will eventually materialize — this is not a prediction but a statistical property of complex systems, documented by Perrow, extended by Vaughan, and confirmed by every major institutional failure of the past half century. The question is whether the structures that would detect the condition and prevent the failure — the review processes, the comprehension requirements, the cognitive redundancy, the independent evaluation — will still be in place when the condition arrives.
The answer, at this moment, depends entirely on whether the normalization of deviance is recognized and corrected before the trigger event makes the recognition retrospectively obvious. Vaughan's entire body of work stands as testimony that retrospective recognition — the clarity that arrives after the catastrophe — is the most expensive form of institutional learning. The alternative is prospective recognition: seeing the drift while it is still operating, measuring the gap while it is still invisible to the metrics the organization tracks, and building the structures that maintain the margin between competent performance and the edge of failure.
The structures exist. They are described, in varying vocabularies and with varying degrees of specificity, in the aviation safety literature, in the medical quality literature, in the organizational resilience literature that Vaughan's work helped establish. They include mandatory reporting of near-misses, independent review processes, redundant evaluation structures, and cultural norms that treat the identification of drift as a contribution rather than an obstruction.
What they require is institutional will — the willingness to invest in protections whose value is invisible until the morning it is the only thing that matters. The willingness to slow down in an environment that rewards speed. The willingness to maintain review standards that production pressure is steadily eroding. The willingness to preserve cognitive redundancy that the budget would prefer to eliminate.
Vaughan's Challenger research ended with a finding that applies to this moment with painful directness: the institution had all the information it needed to prevent the disaster. The information was available. The expertise existed. The engineering judgment was sound. What was absent was the institutional structure that would have allowed the information, the expertise, and the judgment to converge at the point of decision with sufficient force to override the production pressure that was, flight by flight, eroding the margin between performance and catastrophe.
The AI transition is not the Challenger. The stakes are different, the domains are different, the technologies are different. But the mechanism is the same. Reasonable people, making reasonable decisions, under institutional pressures that reward proceeding and penalize stopping, in an environment where the tools produce competent output that conceals the erosion of the understanding and oversight that protect against the failure the tools cannot prevent.
The margin is narrowing. The question is what the institutions that depend on AI-augmented work will build to maintain it — and whether they will build it before or after the morning when the margin runs out.
---
The phrase that rewired the way I read my own company's dashboards was not about artificial intelligence. It was about rubber.
A quarter-inch gasket. An O-ring designed to maintain a seal that, by specification, was supposed to hold perfectly every time. It held imperfectly twenty-four times, and nobody grounded the shuttle, because each imperfection was slightly more familiar than the last. Vaughan's patient reconstruction of that sequence — not dramatic, not villainous, just twenty-four reasonable judgments stacked on top of each other until the stack was tall enough to kill seven people — is the most unsettling thing I have encountered in my months inside the Orange Pill literature.
It is unsettling because I recognized the stack.
Not in a space program. In my own teams. In my own workflows. In the precise, granular, daily way that review becomes scanning becomes glancing becomes trusting, and at no point does anyone experience the transition as a transition. You just wake up one Tuesday morning operating under standards you would not have recognized six months ago, and the dashboard is green, and every metric you track says the system is performing, and you have no instrument that measures the distance between performing and safe.
In The Orange Pill, I wrote about the exhilaration of watching my engineers in Trivandrum achieve a twenty-fold productivity multiplier. I meant every word. But Vaughan's framework forced me to ask a question I had been avoiding: what was inside the multiplier? Not just the speed. Not just the expanded capability. But the twenty small accommodations that made the speed possible — the review that became a scan, the comprehension that became an assumption, the second opinion that became a luxury the schedule could no longer afford. Each accommodation made the multiplier larger. Each accommodation also made the margin thinner.
I do not know how thin the margin is in my own organization. That sentence is the most honest and most uncomfortable thing I have written in this entire cycle of books. Vaughan's deepest finding is not that institutions fail. It is that institutions cannot see the mechanism of their own failure while it is operating, because the mechanism is made of the same material as competent daily practice. The reasonable exception is indistinguishable from good judgment until the morning it becomes the autopsy's central finding.
What Vaughan offers is not pessimism. It is something more useful and more demanding: a set of diagnostic instruments for detecting drift before it becomes disaster. Near-miss reporting. Independent review. Cognitive redundancy that costs money and slows the dashboard and is worth every dollar and every hour, not because it improves the metrics but because it maintains the margin the metrics do not measure.
Building those instruments is the work. Not once, not as a policy memo, but every day — the way she showed the best air traffic controllers maintained their dead reckoning even after the automation arrived: deliberately, against the current, as an act of professional discipline that the production environment would never reward and would always require.
I am a builder. I remain a builder. But I am a builder who now checks the O-rings.
** Every organization adopting AI is running the same experiment NASA ran before Challenger -- relaxing standards one reasonable decision at a time, under production pressure that rewards speed and penalizes the pause that safety requires. Diane Vaughan spent a decade proving that catastrophe does not require villains, incompetence, or broken technology. It requires only competent people making defensible choices inside institutions that have quietly redefined what "safe" means. This book applies her framework -- normalized deviance, structural secrecy, practical drift -- to the AI transition with forensic specificity, tracing the precise mechanism by which review becomes ritual, comprehension becomes optional, and the margin between performance and failure narrows until the trigger event no one predicted finds the gap everyone created.

A reading-companion catalog of the 13 Orange Pill Wiki entries linked from this book — the people, ideas, works, and events that Diane Vaughan — On AI uses as stepping stones for thinking through the AI revolution.
Open the Wiki Companion →