
The concept emerges from applying Williamson’s analysis of informational opportunism to the specific dynamics of AI deployment. Williamson showed that governance failures arise when the signal of quality (the surface of output) is systematically decoupled from the fact of quality (the soundness of what underlies it). Traditional monitoring mechanisms—code review, quality assurance, editorial oversight—were calibrated to surfaces that rough craftsmanship made legible: the poorly written code announced itself through its roughness; the weak analysis was visible in its disorganization. AI-generated output is uniformly smooth regardless of its underlying soundness. The monitoring mechanisms are therefore calibrated to the wrong signal.
The same dynamics appear in Onora O’Neill’s analysis of assessability: when output’s surface characteristics (confidence, fluency, professional polish) are systematically present regardless of whether the underlying content warrants them, the audience loses the information needed to make intelligent trust judgments. The AI governance deficit is, in O’Neill’s vocabulary, the institutional failure to replace the assessability that AI’s smooth surface has removed. Both frameworks identify the same structural problem from different analytical angles: the evaluation infrastructure has not kept pace with the production capability, and the gap between them is growing faster than most organizations are able to address.
The deficit is visible in a range of familiar phenomena: the lawyer who submitted fabricated case citations because the brief read convincingly; the developer who deployed AI-generated code that passed code review but failed at scale because no reviewer understood what the code was actually doing; the analyst who published AI-assisted findings that sounded rigorous but rested on a statistical assumption nobody had checked. In each case the failure was not carelessness but a structural mismatch between the speed of AI-assisted production and the capacity of existing evaluation processes to govern it.
The deficit widens as capability increases. The better AI systems become at producing output that looks right, the harder it is to detect when something is wrong beneath the surface, and therefore the wider the governance deficit becomes. Improving model capability without improving evaluative infrastructure accelerates the production of convincing failures, not just the production of correct outputs. Organizations that celebrate AI productivity gains without building corresponding evaluation capacity are widening the deficit, not managing it.
Depth governance as the institutional response. Williamson’s framework points toward a specific institutional response: depth governance, the practice of evaluating not the surface of output but the quality of the judgment that produced it. Depth governance asks not “Does this code compile?” but “Does the developer understand what the code does and why?” Not “Is this analysis well-organized?” but “Did the analyst verify the AI’s statistical claims against the underlying data?” Depth governance is more expensive than surface governance because it requires evaluators with domain expertise rather than checklist compliance. But it is the only governance mechanism that addresses the specific hazard of smooth-surfaced AI output.
Accountability chains, not transparency frameworks. O’Neill’s parallel analysis insists that transparency is not a substitute for accountability. Publishing model documentation, releasing technical reports, deploying interpretability tools—these provide information without providing recourse. Closing the governance deficit requires building chains of accountability in which specific persons bear specific consequences for specific failures: the developer for systemic model properties, the deployer for contextual fitness, the user for evaluative judgment exercised at the point of reliance. Where no one bears consequences, the governance deficit persists regardless of how much information is available about the system.