Beginning in 2023 and accelerating through 2025, a growing body of empirical research has documented a specific pattern across multiple professional domains: practitioners who rely heavily on AI tools show elevated performance when the tools are available and degraded capability when they are removed. Endoscopists using AI for polyp detection showed a 6-percentage-point drop in adenoma detection rates when AI was withdrawn. Students with GPT-4 access performed better initially but worse than never-AI peers once access was removed. Carnegie Mellon researchers observed that AI-using knowledge workers ceded problem-solving expertise to systems while focusing on integration tasks. Each finding is exactly what the friction requirement predicts: removal of developmental conditions produces erosion of the capability those conditions build.
There is a parallel reading in which every technology transition looks like deskilling from the vantage point of the old skill envelope. The endoscopists losing polyp-detection acuity are simultaneously gaining AI-integration expertise, differential-diagnosis refinement when AI flags ambiguous cases, and higher-order pattern recognition across cases that low-level detection work previously obscured. The six-point drop measures one capability in isolation; it does not measure the bundle of capabilities the new practice regime builds. The question is not whether practitioners using AI lose capabilities the pre-AI workflow required — of course they do, exactly as accountants lost mental arithmetic when calculators arrived — but whether the capabilities they gain access a higher-value layer of work the old regime could not systematically develop.
The educational studies show something similar. Students using GPT-4 underperform never-AI peers on tests designed for the never-AI skill envelope, but that does not establish that the never-AI envelope is the right target. If GPT-4 access allows students to spend cognitive resources on synthesis, critique, and generalization rather than on symbol manipulation the AI handles fluently, then testing symbol manipulation measures regression from a local maximum while missing movement toward a different capability peak. The performance deficit is real; the claim that it represents net capability loss assumes the tests measure what matters going forward, which is precisely what the technology shift puts in question.
The Hosanagar research at Wharton on endoscopist deskilling has become the most-cited example because the domain is high-stakes, the measurement is precise, and the effect size is clinically significant. Adenoma detection rates of 28% fell to 22% when AI was removed — a six-point gap that, at screening population scale, translates to thousands of missed cancers. The finding is particularly striking because endoscopy is a domain of continuous practice: the physicians were performing the procedure constantly. What they were not performing, when AI was providing the polyp detection, was the specific perceptual work of noticing polyps themselves. The perceptual capability atrophied within the broader procedural capability that appeared intact.
The educational evidence is equally clear. Studies of students using GPT-4 for mathematics and other subjects consistently find the same pattern: enhanced performance with AI, degraded performance without it compared to peers who had never used AI. The performance gain was real; the underlying capability deficit was real; the two coexisted and were invisible to both students and instructors until explicit testing under tool-free conditions revealed them.
The knowledge-work evidence from Carnegie Mellon and Microsoft Research in 2025 extended the pattern into white-collar professional work. The study documented that AI-using workers reported their tasks as cognitively easier while researchers observed them ceding problem-solving to the AI and focusing on what the paper called 'functional tasks like gathering and integrating responses.' The workers experienced empowerment; the researchers observed automation dependence. Both observations were accurate descriptions of different dimensions of the same phenomenon.
The pattern across these studies is precisely what Ericsson's framework predicted. When the conditions for deliberate practice are removed at the specific sites where they operated in traditional practice, the capability those conditions build stops being built. The output quality is preserved by the tool. The underlying capability deteriorates. The two can be distinguished only by testing under conditions the tool cannot mediate, and most institutional assessment methods are not currently designed to make this distinction.
The first wave of AI-deskilling research in the ChatGPT era began publishing in 2023, accelerated through 2024, and became a substantial literature by 2025. Earlier precedents include studies of GPS-dependent drivers losing navigational capability, calculator-dependent students losing arithmetic fluency, and autopilot-dependent pilots losing manual flight skills. These precedents are cited in the contemporary literature as evidence that the pattern is general across tool categories, not specific to generative AI.
Convergent evidence across domains. Medicine, education, knowledge work, and creative fields all show the same pattern with varying effect sizes.
Performance-capability decoupling is measurable. Standard assessment under tool-available conditions misses the deficit; tool-free assessment reveals it.
Clinically significant effect sizes. The endoscopist data and similar findings translate to outcomes with real human consequences at population scale.
Subjective experience misleads. Workers consistently report empowerment while external measurement shows deskilling — a metacognitive failure the performance-learning distinction explains.
Framework confirmation. The evidence validates predictions the deliberate practice framework generated from first principles before the tools that would test them existed.
The deskilling evidence is definitive about what it measures: specific capabilities degrade when practice conditions that built them are removed. The endoscopist data is clinically precise, the educational studies are methodologically sound, and the knowledge-work observations document real phenomena. The question is what weight to assign this evidence in three different contexts.
For near-term safety-critical domains, the evidence deserves ~90% weight: we need practitioners who can function when systems fail, and the current training infrastructure is not maintaining that capability. For educational settings, the weighting splits by question: the evidence is ~80% dispositive that students are not building skills the tests measure, but only ~40% informative about whether those skills remain the right target, because curriculum has not yet adapted to clarify what competence means in tool-rich environments. For knowledge work, the evidence is ~70% confirmatory of capability loss in the old envelope and ~30% informative about net capability, because most studies have not yet measured the higher-order capabilities the contrarian view claims are under construction.
The synthesis both views require is a developmental account: some capabilities must stabilize before tool augmentation adds value (you cannot critique AI-generated code without code-reading fluency), while others become accessible only after tools remove lower-level cognitive load (you cannot do certain kinds of architectural thinking while manually managing memory). The orange pill evidence documents the first dynamic with precision. The open empirical question is how much of professional capability falls into each category, and existing studies have not been designed to answer it.