By Edo Segal
The muscle I trusted most was the one that was atrophying.
I discovered this not through philosophy or data or a late-night argument on a Princeton path. I discovered it through my hands. Through the specific sensation of sitting down to solve a problem I had solved a hundred times before — and finding the solution would not come.
It had been there six months ago. I was certain of that. I had debugged this class of error for years. The pattern was as familiar to me as my own handwriting. But somewhere between December and June, between the first time I opened Claude Code and the morning I sat staring at a terminal without it, the pattern had faded. Not completely. Not catastrophically. Just enough that I noticed the hesitation where confidence used to live.
I described this experience in *The Orange Pill* — my engineer in Trivandrum who lost architectural confidence she couldn't explain, the developer culture that celebrates speed without measuring what speed costs. I described Byung-Chul Han's diagnosis of smoothness and the Berkeley study that documented AI intensifying work without deepening it. I built the best framework I could for understanding what was happening.
Then I found the scientist who had already mapped the territory.
Robert A. Bjork has spent four decades proving something that contradicts every instinct the technology industry runs on: the conditions that make learning feel effective are often the conditions that make it ineffective. Struggle builds durable knowledge. Ease builds the illusion of it. The feeling of mastery and the fact of mastery are not just different — they are frequently opposed.
This is not cultural criticism. It is not philosophy. It is replicated experimental evidence, tested across thousands of participants and dozens of domains, that describes with uncomfortable precision what happens to the human mind when difficulty is removed.
When I wrote that AI is an amplifier, I meant it carries whatever signal you feed it. Bjork's work specifies what happens to the signal itself — to the cognitive architecture that generates it. Feed the amplifier a mind shaped by productive struggle, and the output carries depth. Feed it a mind shaped by frictionless fluency, and the output carries the appearance of depth. The amplifier cannot tell the difference. Only the delayed test can.
This book is the delayed test. It asks what Bjork's science means for every builder, parent, teacher, and leader navigating the most powerful ease-producing technology in human history. The answers are uncomfortable. They are also the most practically useful thing I have encountered since taking the orange pill.
The muscle that matters most is the one that only grows under load.
— Edo Segal ^ Opus 4.6
1939-present
Robert A. Bjork (born 1939) is an American cognitive psychologist and Distinguished Research Professor at the University of California, Los Angeles, where he has conducted research on human learning and memory since 1972. Educated at the University of Minnesota (Ph.D., 1966), Bjork is best known for developing the concept of "desirable difficulties" — the counterintuitive finding, supported by decades of controlled experiments, that conditions introducing difficulty during learning (such as spacing practice, interleaving problem types, and delaying feedback) produce superior long-term retention and transfer, even as they reduce immediate performance. Together with Elizabeth Ligon Bjork, he formulated the New Theory of Disuse, which distinguishes between storage strength and retrieval strength in memory and explains why easily accessible information is often poorly encoded. His work on metacognitive illusions — the systematic tendency of learners to confuse fluent processing with genuine understanding — has influenced educational practice, professional training, and the design of learning technologies worldwide. A former editor of *Psychological Review* and *Psychonomic Bulletin & Review*, recipient of numerous honors including the APA Distinguished Scientific Contribution Award, Bjork remains one of the most cited researchers in the science of learning.
For seventy thousand years, Homo sapiens has been building tools to make things easier.
Fire made warmth easier. Agriculture made food easier. Writing made memory easier. Printing made distribution easier. Electricity made labor easier. Computing made calculation easier. Each tool was celebrated. Each tool expanded what a single human being could accomplish in a finite life. And each tool, at the moment of its arrival, quietly removed a difficulty that had been doing something important — something the users of the new tool could not see because the benefit was invisible and the loss was felt only later, if it was felt at all.
Robert A. Bjork has spent four decades studying a finding that cuts against the deepest assumptions of this trajectory. The finding is simple to state, difficult to internalize, and — in the age of artificial intelligence — more consequential than at any previous moment in the history of cognitive science.
The conditions that make learning feel effective are frequently the conditions that make learning ineffective. And the conditions that make learning feel ineffective are frequently the conditions that make it durable.
This is not a philosophical position. It is not a cultural critique. It is an empirical finding, replicated across hundreds of experiments, thousands of participants, and dozens of domains ranging from vocabulary acquisition to surgical training to motor skill development. The finding has survived every challenge the field of cognitive psychology has thrown at it for forty years. It is as close to a law of human cognition as the discipline possesses.
And it has implications for artificial intelligence that almost no one in the technology industry has reckoned with.
Consider a straightforward experimental design, the kind Bjork and his colleagues have run in various forms since the 1970s. Two groups of students study the same material. Group A studies the material in a single concentrated session — the way most people study, the way that feels natural, the way that produces the satisfying sense of mastery that comes from reading something three times in a row and feeling the information settle into place. Group B studies the same material in sessions distributed across several days, with gaps between sessions during which the material begins to fade from accessible memory.
During the study phase, Group A outperforms Group B on every measure. The massed-study students feel more confident. They recognize the material more easily. They report higher satisfaction with their learning. If a teacher walked into the room and administered a quiz at the end of the first session, the massed-study students would score higher. By every metric visible during learning, the easy path wins.
Then comes the delayed test. A week later. A month later. The same material. The same questions. And the results invert. Group B — the students who studied with difficulty, who experienced the discomfort of returning to material that had begun to slip away, who felt the friction of effortful retrieval — remembers significantly more. The learning that felt worse produced outcomes that were better. The learning that felt better produced outcomes that were worse.
The paradox is not subtle. It is a direct inversion of the relationship between feeling and fact. And it operates not at the margins but at the center of how human beings acquire durable knowledge.
Bjork calls the productive difficulties that enhance long-term learning "desirable difficulties." The term is deliberately provocative. Difficulty is not normally desirable. The entire arc of human tool-making has been directed toward its elimination. The word carries a built-in contradiction that forces the listener to stop and reconsider the assumption that easier is always better.
Four desirable difficulties have been identified and extensively documented: spacing practice over time rather than massing it, interleaving different types of problems rather than practicing one type at a time, varying the conditions under which practice occurs rather than keeping them constant, and reducing or delaying feedback rather than providing it immediately. Each of these interventions slows initial performance. Each makes the learner feel less competent during practice. Each produces measurably superior outcomes on tests of long-term retention and transfer to new situations.
The mechanism is consistent across all four: difficulty forces the brain to engage in deeper processing. When retrieval is easy — when the answer is right there, when the material was just reviewed, when the feedback arrives instantly — the brain processes the information shallowly. It recognizes the answer rather than reconstructing it. The encoding is thin. The memory trace is fragile. It will decay quickly and resist transfer to new contexts.
When retrieval is hard — when the material has begun to fade, when the problem type is unexpected, when the feedback is delayed long enough for the learner to generate their own assessment — the brain must work to reconstruct the information. This reconstructive effort is not wasted energy. It is the learning event itself. The difficulty is not an obstacle to understanding. It is the cognitive operation through which understanding is built.
This is the finding that the technology industry has never absorbed. It is the finding that should be on the wall of every AI company, every school district adopting AI tools, every organization deploying machine intelligence into workflows where human beings are expected to develop expertise. It is also the finding that the market has every incentive to ignore, because the market rewards tools that feel good, and desirable difficulties feel bad.
The relevance to artificial intelligence is immediate and specific. AI tools — large language models, code assistants, AI tutoring systems, automated research platforms — are, in Bjork's framework, difficulty eliminators of unprecedented power. They provide instant answers to any question, eliminating the spacing that forces effortful retrieval. They solve complete problems of a single type, eliminating the interleaving that forces discrimination between problem categories. They operate under consistent, optimized conditions, eliminating the variation that forces adaptive flexibility. They deliver immediate, confident, fluent responses, eliminating the delay that forces the learner to generate their own assessment before seeing the correct one.
Every desirable difficulty that four decades of research has identified as essential to durable learning is systematically removed by the default operation of AI tools.
The removal feels good. That is the problem. The developer who receives a working debugging solution from Claude in thirty seconds feels productive. The student who receives an articulate essay outline from ChatGPT in twenty seconds feels confident. The lawyer who receives a comprehensive case analysis in minutes feels competent. The feeling is not imaginary. It is a real metacognitive signal, and the brain interprets it exactly the way it interprets the feeling of mastery that massed practice produces: as evidence that learning has occurred.
The feeling is wrong. The signal is false. Ease of processing is the primary cue that the brain's monitoring system uses to judge learning, and that cue is systematically misleading. This is the metacognitive illusion at the center of Bjork's work, and AI has amplified it to civilizational scale.
Bjork's research draws a distinction that is foundational to everything that follows in this analysis — the distinction between performance and learning. Performance is what the learner can do right now, during training, under current conditions. Learning is the relatively permanent change in knowledge or behavior that supports long-term retention and the ability to transfer knowledge to new situations. The two are not merely different. They are frequently inversely related. The conditions that maximize current performance often minimize durable learning, and the conditions that maximize durable learning often suppress current performance.
This distinction has been demonstrated so consistently that it qualifies as one of the most robust findings in experimental psychology. And it is the distinction that the AI revolution has rendered invisible.
AI tools optimize for performance. They are designed to maximize what the user can produce right now, in this session, on this task. The metrics by which they are evaluated — speed, accuracy, output quality, user satisfaction — are all performance metrics. No AI tool is evaluated on what the user can do next month without the tool. No product review measures whether the user's independent capability has increased or decreased over the period of use. The entire feedback loop — from design to deployment to evaluation — operates on performance, and performance is the wrong measure.
This is not a minor oversight. It is a structural blindness built into the incentive architecture of the technology industry. The market rewards tools that improve performance. Users choose tools that make them feel productive. Organizations adopt tools that increase output. And the learning that would have occurred through difficulty, through the friction of effortful retrieval and the discomfort of delayed feedback — that learning goes unmeasured, unmourned, and eventually unremembered.
The relationship between Bjork's research program and Segal's description of the AI transition in The Orange Pill is precise. Segal argues that AI is an amplifier — it amplifies whatever signal the user feeds it, without filtering for quality. Bjork's research specifies what happens to the signal when the amplifier removes difficulty from the transmission. The signal gets louder. The output gets more polished. The performance gets more impressive. And the underlying learning — the storage strength, the retrieval architecture, the flexible expertise that would have been built through struggle — grows thinner with each frictionless cycle.
The paradox, then, is not merely academic. It is the paradox that every parent, every teacher, every leader, every individual navigating the AI transition must confront. The tools that make cognitive work easier may be making the humans who use them less capable. Not because the tools are malicious, and not because ease is always wrong, but because the human brain — the specific biological organ that processes information and builds expertise — requires difficulty the way a muscle requires resistance. Remove the resistance, and the muscle atrophies. The atrophy is invisible in the short term because the tool compensates. It becomes visible only when the tool is removed and the user discovers that the capability they thought they had was the tool's capability, not their own.
Bjork's own lab has studied a precursor to this problem. Research on how internet search affects subsequent retention — titled, with characteristic precision, "Answer First or Google First?" — investigated what happens when people search for information before attempting to retrieve it from memory. The finding was consistent with the desirable difficulties framework: searching first reduced subsequent retention compared to attempting retrieval first. The effortless access to the answer preempted the effortful retrieval that would have strengthened the memory.
Google was the preview. AI is the feature film. If instant search results reduced subsequent retention, instant AI-generated solutions — more complete, more confident, more fluent than any search result — represent a cognitive environment in which the conditions for durable learning have been not merely reduced but, for many tasks, eliminated entirely.
The question this analysis poses is not whether AI tools should exist. That question has been answered by the market, by adoption curves, by the measurable productivity gains that Segal documents in his account of the winter of 2025-2026. The question is what happens to the human beings who use them. Whether the unprecedented expansion of what a person can produce will be accompanied by a corresponding expansion of what a person can understand — or whether production and understanding will diverge, with production soaring and understanding quietly eroding beneath the surface of impressive output.
The paradox at the center of Bjork's work is that the answer to this question depends entirely on how the tools are used — and that the way human beings naturally prefer to use tools is precisely the way that produces the worst long-term outcomes. Left to their own metacognitive devices, people will choose the easy path every time. The easy path feels right. It feels productive. It feels like learning. And the feeling is a lie.
The science is clear. The evidence is replicated. The mechanism is understood. What remains is the harder question: whether a civilization that has built the most powerful ease-producing technology in its history will choose to preserve the difficulties that built its expertise.
That question is not answered in the laboratory. It is answered in the design choices of the people who build AI tools, the pedagogical choices of the people who teach with them, the institutional choices of the organizations that deploy them, and the personal choices of the individuals who use them every day.
The paradox demands a choice. The science says what the right choice is. Whether the choice will be made is another matter entirely.
---
A medical student in her second year of training sits down to study for an anatomy examination. She has two hundred structures to identify, from the brachial plexus to the branches of the external carotid artery. She has twelve days before the test. She has a choice — a choice that will determine not only her exam score but, in a way she cannot perceive from where she sits, the kind of physician she will become.
Option one: she studies all two hundred structures tonight, then reviews them tomorrow night, then again the night before the exam. Each session covers the full set. By the third session, the material feels familiar. She can recite the branches of the brachial plexus without hesitation. She feels ready.
Option two: she divides the structures into groups of twenty, studies one group each day, and returns to previously studied groups only after a gap of several days — a gap during which the material has begun to fade from accessible memory. Each return visit feels harder than it should. She finds herself struggling to recall structures she recognized easily two days ago. The experience is frustrating. She does not feel ready.
The research predicts the outcome with a consistency that approaches certainty. Option one will produce a higher score if the exam is given the next morning. Option two will produce a higher score if the exam is given a week later, a month later, a year later. Option two will also produce better transfer — the ability to apply anatomical knowledge in novel clinical situations that the student has never studied directly.
The medical student who chose difficulty becomes the physician who can think. The medical student who chose ease becomes the physician who must look things up. Both passed the exam. Only one built the architecture.
This is the architecture of struggle that Bjork's research has mapped across four decades. It is not a single finding but a system of interlocking mechanisms, each of which produces a specific cognitive benefit that cannot be obtained through ease. Four primary desirable difficulties have been identified, tested, and replicated extensively. Each one operates through a distinct mechanism. And each one is eliminated by the default operation of artificial intelligence tools.
The spacing effect is the oldest and most robust of the four. Its empirical history stretches back to Hermann Ebbinghaus in 1885, making it one of the longest-standing findings in experimental psychology. The basic phenomenon is straightforward: distributing practice across time produces better long-term retention than concentrating the same amount of practice into a single session.
The mechanism, as Bjork's framework articulates it, involves the relationship between forgetting and retrieval effort. When a learner returns to material after a gap — after some forgetting has occurred — the act of retrieving that partially forgotten information requires cognitive effort. This effort is not incidental to the learning. It is the learning. The brain, forced to reconstruct information that has begun to decay, encodes it more deeply on the second pass. The deeper encoding produces a more durable and more accessible memory trace. The forgetting was not a failure. It was the setup for a more powerful learning event.
AI tools eliminate spacing by making every piece of information instantly available at all times. The developer who encounters a problem does not need to retrieve the solution from memory — Claude provides it in seconds. The student who forgets a concept does not need to struggle to reconstruct it — ChatGPT restates it fluently and completely. The gap that would have produced forgetting, and therefore the effortful retrieval that would have produced deep encoding, is filled before it opens. The information remains accessible. But accessible is not the same as learned, and the difference between the two is the difference between a capability that depends on the tool and a capability that belongs to the person.
Interleaving is the second desirable difficulty, and its mechanism is distinct from spacing. In interleaved practice, the learner encounters different types of problems mixed together rather than grouped by type. A mathematics student practicing interleaved problems might encounter a probability question, then a geometry question, then an algebra question, then another probability question. This mixing feels inefficient. The student must constantly shift mental gears, determine which type of problem she is facing, and select the appropriate strategy. Performance during practice suffers compared to blocked practice, where all the probability problems are done first, then all the geometry problems, then all the algebra.
The benefit appears on the delayed test. The interleaved student has learned something that the blocked student has not: discrimination. The ability to identify which type of problem she is facing. In blocked practice, the problem type is given by context — all the problems in this block are probability problems, so use the probability strategy. In interleaved practice, the problem type must be determined by the student before the strategy can be selected. This additional cognitive step — the step of categorization — forces a deeper engagement with the structural features of each problem type. The student learns not just how to solve each type but when to apply each approach.
AI tools eliminate interleaving by providing complete, type-specific solutions. When a developer asks Claude to debug a specific error, Claude delivers a solution optimized for that specific error type. The developer does not need to determine whether the problem is a type error, a logic error, a concurrency issue, or an architectural flaw. The categorization step — the step that builds the diagnostic expertise through which a senior engineer can look at a codebase and feel that something is wrong before articulating what — is performed by the tool. The developer receives the solution. The discrimination that would have built diagnostic judgment is bypassed.
The third desirable difficulty is variation in practice conditions. When a learner practices under varied conditions — different environments, different equipment, different presentations of the same material — the resulting knowledge is more flexible and more transferable than knowledge acquired under constant conditions. A basketball player who practices free throws from slightly different positions on the court, or with slightly different balls, develops a more robust motor program than one who practices from the identical position with the identical ball every time. The constant-condition player is more accurate during practice. The variable-condition player is more accurate in the game — when conditions are never identical to practice.
The mechanism is encoding variability: practice under varied conditions produces multiple retrieval routes to the same knowledge, making it accessible from a wider range of starting points. This variability is what makes knowledge flexible — available not just in the specific context where it was acquired but in novel contexts that share deep structural features with the original.
AI tools operate under highly consistent conditions. The interface is standardized. The response format is uniform. The interaction pattern — prompt, receive, evaluate — is the same regardless of the content. The developer who learns to use Claude for debugging learns to debug through Claude, in Claude's interface, with Claude's response patterns. The knowledge is encoded in a context that is narrow and specific. Transfer to a situation without Claude — a production environment at two in the morning when the API is down, a whiteboard interview, a conversation with a colleague about system architecture — is impaired precisely because the conditions under which the knowledge was "learned" were too consistent. The variation that would have produced flexible, transferable understanding was optimized away.
The fourth desirable difficulty is the reduction or delay of feedback. This is perhaps the most counterintuitive of the four, because feedback is almost universally regarded as beneficial for learning. And feedback is beneficial — but the timing matters. Immediate feedback, delivered before the learner has had the opportunity to evaluate their own response, short-circuits a critical cognitive process: the self-assessment that produces metacognitive calibration. When feedback is delayed, the learner must sit with uncertainty. They must ask themselves: Was I right? Am I confident? Where might I be wrong? This self-interrogation is itself a learning event. It builds the metacognitive accuracy — the ability to know what you know and know what you don't — that is essential for self-directed learning and expert judgment.
AI tools provide the most immediate, most confident, most fluent feedback in the history of human tool use. Ask a question, receive an answer. In seconds. With no uncertainty. With no gap in which the learner might assess their own understanding before being told the correct response. The feedback loop is so tight that the self-assessment step is not merely shortened but eliminated. The learner never has the experience of generating an answer and sitting with the uncertainty of whether it is right. The uncertainty — the productive discomfort that builds metacognitive calibration — is resolved instantly by a machine that does not hesitate.
What emerges from this analysis is not a list of isolated findings but a system. The four desirable difficulties are not independent interventions that happen to share a name. They are components of a cognitive architecture — a structure of productive struggle through which the human brain builds durable, flexible, transferable expertise. Spacing forces effortful retrieval. Interleaving forces discrimination. Variation forces flexible encoding. Delayed feedback forces self-assessment. Together, they constitute the conditions under which human beings learn most deeply.
AI tools, as currently designed, collapse this architecture. Not partially. Comprehensively. They eliminate spacing by providing instant access. They eliminate interleaving by delivering type-specific solutions. They eliminate variation by operating under standardized conditions. They eliminate delayed feedback by responding before the learner can evaluate their own understanding.
What remains is an architecture of ease. Smooth. Efficient. Productive. And systematically hostile to the cognitive processes through which expertise develops.
Segal describes, in The Orange Pill, the experience of watching his engineers in Trivandrum achieve a twenty-fold productivity multiplier with Claude Code. The output was extraordinary. The capability expansion was real. But Segal also describes an engineer who, months later, realized she was making architectural decisions with less confidence than she used to — and could not explain why. Bjork's framework explains why. The four hours of daily "plumbing" work that Claude eliminated had contained, embedded within the tedium, the desirable difficulties through which her architectural intuition was built. The spacing of returning to configuration problems after gaps. The interleaving of different system types in the same workday. The variation of different deployment environments. The delayed feedback of discovering, hours later, that a configuration choice had downstream consequences she had not anticipated. The difficulties were invisible because they were woven into the fabric of work. When the fabric was removed, the difficulties went with it.
This does not mean the productivity gains were illusory. They were real. The question is whether the gains came at a cost that was not measured because the cost was invisible in the short term — visible only on the delayed test that Bjork's research says is the only honest measure of what has actually been learned.
The architecture of struggle is not an argument for suffering. It is not a claim that difficulty is good in itself. Bjork is careful to distinguish desirable difficulties from difficulties that are merely difficult — obstacles that impede learning without producing the deeper processing that makes difficulty productive. A poorly written textbook is difficult. It is not desirably difficult. A confusing interface is difficult. It does not build expertise. The question is always whether the difficulty engages the specific cognitive processes — effortful retrieval, discrimination, flexible encoding, self-assessment — that produce durable learning. When it does, the difficulty is desirable. When it does not, it is merely frustrating.
AI eliminates both kinds of difficulty indiscriminately. It removes the frustrating friction of badly designed tools, and that removal is genuinely beneficial. But it also removes the productive friction of effortful cognition, and that removal carries a cost that the performance metrics by which AI tools are evaluated will never capture.
The architecture must be understood before it can be preserved. And understanding it requires accepting a proposition that the entire trajectory of human tool-making has conditioned us to reject: that some of the difficulty we have been trying to eliminate is difficulty we need.
---
In the early 1990s, Asher Koriat published a series of experiments at the University of Haifa that illuminated something deeply unsettling about the human mind: it does not know what it knows. More precisely, the system that monitors learning — the metacognitive apparatus that generates the feeling of understanding, the subjective sense that information has been successfully encoded — relies on cues that are systematically unreliable.
The primary cue is fluency. When information is processed easily — when it flows, when recognition is immediate, when the answer comes to mind without effort — the metacognitive monitor registers a high confidence signal. The signal says: this has been learned. The signal is often wrong.
Robert A. Bjork, building on Koriat's findings and his own extensive research program, developed a framework for understanding these metacognitive illusions that has become central to the science of learning. The framework is simple in structure and devastating in its implications: people judge their learning primarily on the basis of how the learning experience feels, and the relationship between how learning feels and how much has actually been learned is, across a wide range of conditions, near zero — or negative.
The student who rereads her notes three times feels prepared because the material is familiar. Familiarity is a fluency signal. The metacognitive monitor interprets familiarity as learning: I recognize this, therefore I know this. But recognition and recall are different cognitive operations, and the ease of recognition during study is a poor predictor of the difficulty of recall on the test. The student has experienced what Bjork and colleagues call a "stability bias" — the tendency to overestimate the durability of current knowledge. What feels accessible now feels like it will be accessible later. It will not be. The accessibility was an artifact of recency, not encoding depth.
This metacognitive architecture — the system that generates feelings of knowing and judgments of learning — is not a design flaw that natural selection failed to correct. It is a monitoring system that evolved in an environment where the fluency of processing was, in fact, a reasonable indicator of familiarity and therefore of safety. A face that is processed fluently is a face you have seen before — probably a member of your group rather than a stranger. A path that looks familiar is a path you have traveled — probably safe rather than unknown. The heuristic worked in the ancestral environment because the ancestral environment did not contain tools that could make unfamiliar information feel fluent.
AI tools make unfamiliar information feel fluent.
This is the metacognitive catastrophe at the center of the AI transition, and it operates below the threshold of conscious awareness. When a developer receives a debugging solution from Claude, the solution arrives in clean, well-structured code. The developer reads the code. The code makes sense. The processing is fluent — the solution is well-formatted, logically organized, syntactically correct. The metacognitive monitor registers the fluency and generates a judgment: I understand this.
The judgment may be false. Understanding a presented solution and being able to generate that solution independently are different cognitive states, mediated by different memory processes. Reading well-structured code and being able to write well-structured code are as different as recognizing a face in a crowd and drawing that face from memory. The feeling is the same — comprehension, clarity, the subjective sense of getting it. The underlying cognitive reality may be profoundly different.
Bjork's research demonstrates that this illusion is not merely common. It is the default. In experiment after experiment, participants who study under easy conditions — massed practice, blocked problems, immediate feedback — report higher confidence in their learning than participants who study under difficult conditions. And in experiment after experiment, the high-confidence group performs worse on delayed tests than the low-confidence group. The relationship between confidence and competence is not merely weak. It is inverted. The people who feel most prepared are, on average, least prepared. The people who feel least prepared have engaged in the deeper processing that produces durable understanding.
Segal's account of his own experience writing The Orange Pill with Claude provides a striking illustration. He describes a moment when Claude produced a passage connecting Csikszentmihalyi's flow state to a concept attributed to Gilles Deleuze. The passage was elegant. The connection was compelling. The prose was fluent. Segal read it twice, liked it, and moved on. The next morning, something nagged. He checked. The philosophical reference was wrong — wrong in a way that anyone who had actually read Deleuze would have caught immediately, but that the fluency of the passage had concealed.
This is the metacognitive illusion in its most dangerous form: confident wrongness dressed in good prose. The fluency of the output — its grammatical perfection, its logical structure, its rhetorical polish — generated a feeling of rightness that preempted the critical evaluation that would have caught the error. Segal's metacognitive monitor did what metacognitive monitors do: it interpreted ease of processing as evidence of truth. The ease was an artifact of Claude's language model, not of the accuracy of the claim. And the gap between how right it felt and how wrong it was is the gap that Bjork's research has been measuring for four decades.
The danger scales. A single reader catching a single philosophical error is a minor episode. A generation of professionals developing their skills through AI-assisted work, building their confidence on the fluency of AI-generated output, is a systemic metacognitive failure. Each interaction reinforces the illusion. Each fluent answer strengthens the association between ease and understanding. Each time the developer accepts Claude's solution without generating their own attempt first, the metacognitive monitor receives another data point confirming that fluent reception is equivalent to genuine comprehension.
The accumulation of these data points produces what might be called metacognitive drift — a gradual, imperceptible shift in the individual's calibration between confidence and competence. The developer who has used AI assistance for six months is not merely dependent on the tool for solutions. She has become a worse judge of what she knows and does not know. Her metacognitive monitoring system has been trained, through thousands of fluency-rich interactions, to overestimate her independent capability. She feels like an expert. The feeling is an artifact of the tool's fluency, not her own understanding.
Kornell and Bjork, in a 2009 paper, documented what they called a "stability bias" — the tendency for learners to assume that their current level of accessibility to information will persist over time. If I can recall it now, I will be able to recall it later. If the answer feels available now, it will feel available tomorrow. This bias is particularly insidious in the AI context because the tool makes every piece of information perpetually accessible. The developer who can always ask Claude never experiences the forgetting that would reveal the fragility of her independent knowledge. The lawyer who can always query an AI research tool never discovers that the case law she felt confident about was confidence she had borrowed from the tool's retrieval system, not confidence she had earned through her own encoding.
The stability bias means that the loss of independent capability is invisible to the person losing it. The metacognitive monitor reports everything is fine — the information is available, the answers are correct, the work is good — because the monitor cannot distinguish between information that is available through the tool and information that is available through the person. From the inside, borrowed competence feels identical to owned competence.
This is the mechanism through which AI produces what Bjork's framework would describe as the largest metacognitive illusion in human history. Not because the tools are wrong — they are often right. Not because the users are careless — they are often diligent. But because the brain's monitoring system evolved to use fluency as a proxy for understanding, and AI tools produce fluency at a scale and consistency that no previous technology has approached. The proxy has been compromised. The signal has been hacked. And the people whose cognition is being shaped by this compromised signal cannot detect the compromise from the inside.
There is a test. Bjork's framework implies it directly, and a few researchers have begun to formalize it: the dependency audit. Remove the tool. Administer a performance assessment under conditions where AI assistance is unavailable. Measure what the person can do independently. Compare it to what they could do before they began using the tool, or to what a matched peer who did not use the tool can do.
This test is almost never administered. Organizations do not want to know the answer. Individuals do not want to know the answer. The AI industry has no incentive to know the answer. The test would reveal, with uncomfortable specificity, the gap between the performance that the tool enables and the learning that the person has achieved. And that gap, in Bjork's framework, is the measure of the metacognitive illusion's depth.
The dependency audit is uncomfortable because it forces a distinction that the entire AI productivity narrative has worked to dissolve: the distinction between what you can produce and what you understand. In the performance paradigm — the paradigm that governs the technology industry, the modern workplace, and increasingly the educational system — what you can produce is what matters. Output is the measure. The dependency audit inserts a different measure: what you retain, what you can transfer, what you can do when the tool is gone.
The measures are different. The people who score highest on one often score lowest on the other. And the metacognitive illusion ensures that the people who most need the audit are the people who least believe they need it — because they feel competent, and the feeling is the lie.
Bjork's work suggests that the illusion can be partially corrected through metacognitive training — the deliberate practice of evaluating one's own learning, of testing oneself before checking the answer, of tracking the gap between predicted and actual performance. This training is effortful, uncomfortable, and unglamorous. It requires the learner to confront the discrepancy between how much they feel they know and how much they actually know. Most learners resist this confrontation. It is, in Bjork's terminology, a desirable difficulty of its own — a difficulty of self-knowledge that produces more accurate self-monitoring and therefore better learning decisions.
AI tools could, in principle, support this metacognitive training. A system designed to ask the user, "What do you think the answer is?" before providing its own answer would preserve the generation effort and the self-assessment that immediate fluent responses eliminate. A system that tracked the user's independent accuracy over time and reported it back — "You predicted you would get 90% correct without assistance; your actual independent accuracy was 62%" — would provide the calibration data that metacognitive accuracy requires.
These features are technically trivial. They are commercially disastrous. The tool that forces you to struggle before it helps you is the tool that loses to the tool that helps you immediately. The market selects for fluency. And fluency, as four decades of research have demonstrated, is the feeling that learning is happening — not the evidence that it has.
The irony at the center of the metacognitive research is this: the people who most need to hear that their confidence is unfounded are the people whose confidence makes them least likely to listen. The illusion is self-sealing. It generates the very confidence that prevents its detection. And AI, by producing fluency at unprecedented scale and consistency, has made the seal airtight.
Breaking it requires something that no tool can provide and no market will reward: the willingness to discover that you know less than you think you do. In Bjork's laboratory, this willingness is built through experimental evidence — the controlled demonstration that easy conditions produce overconfidence and difficult conditions produce understanding. Outside the laboratory, building this willingness is the work of education, of institutional design, of organizational culture, and of personal courage.
The feeling of knowing is not knowledge. The confidence of fluency is not competence. And the gap between the two is growing wider with every frictionless AI interaction, invisible to the people inside it, measurable only by those willing to administer the test that no one wants to take.
---
In 1978, Norman Slamecka and Peter Graf published an experiment that should have changed how every teacher teaches, every trainer trains, and every student studies. It did not, because its implication was too uncomfortable and too counterintuitive for institutions that had organized themselves around the opposite assumption.
The experiment was simple. Participants were given pairs of words. In one condition, both words were presented: the pair RAPID — FAST appeared on the screen, and the participant read it. In the other condition, the first word was presented with a fragment of the second: RAPID — F_S_. The participant had to generate the missing word.
On a subsequent memory test, the generated words were remembered significantly better than the read words. The difference was large, robust, and replicable. It held across variations in materials, in timing, in participant populations. The finding became known as the generation effect, and it established a principle that subsequent decades of research have only strengthened: the act of producing information from one's own cognitive resources creates a stronger memory trace than the act of receiving the same information from an external source.
The mechanism is related to, but distinct from, the spacing and interleaving effects. Generation forces the learner to search through memory, activate related concepts, and construct a response. This constructive process engages associative networks in ways that passive reception does not. When you read the word FAST, you process it. When you generate the word FAST from the cue F_S_, you activate a search process that touches related words (FIRST, FIST, FEAST), rejects candidates that do not fit, and selects the target — all before the word appears on the screen. The search process is itself a learning event. It strengthens not just the target word but the network of associations surrounding it.
The implication is radical: the learner who struggles to produce an answer, even if the answer produced is wrong, learns more than the learner who receives the correct answer without struggling. The quality of the answer matters less than the quality of the cognitive effort. The struggle is not a cost that must be paid before learning occurs. The struggle is where the learning occurs.
This finding intersects with a parallel line of research that Bjork's work has extensively contributed to: retrieval practice, sometimes called the testing effect. The testing effect demonstrates that the act of retrieving information from memory — as opposed to restudying it — produces substantially better long-term retention. A landmark 2006 study by Roediger and Karpicke showed that students who studied a passage once and then took a practice test on it remembered more, one week later, than students who studied the passage four times without testing. Four exposures to the material produced less learning than one exposure plus one retrieval attempt.
The finding was not merely a laboratory curiosity. It held in classroom studies, in medical education, in professional training programs. Retrieval practice — the deliberate act of pulling information out of memory rather than putting it back in — emerged as one of the most powerful learning strategies available. And its power derives from the same mechanism as the generation effect: the effort of production is the learning event. Retrieval is not a test of what has been learned. It is the process through which learning happens.
These two findings — the generation effect and the testing effect — converge on a principle that has profound implications for the AI age: cognitive production is learning. The act of constructing an answer, drafting a solution, formulating an argument, debugging a piece of code through your own reasoning — these acts of production are not merely evidence that learning has occurred. They are the mechanism through which learning occurs. Remove the production, and you remove the learning.
AI tools perform a specific cognitive operation on their users: they convert generation into reception. The developer who would have generated a debugging strategy — searched through her knowledge of the codebase, activated her understanding of how similar bugs behave, constructed a hypothesis, tested it, revised it — instead receives a strategy from Claude. The writer who would have struggled to articulate an argument — searched for the right framing, rejected formulations that did not capture the idea, built the argument sentence by sentence through the friction of thinking on the page — instead receives a polished draft. The student who would have retrieved a concept from memory — felt the tip-of-the-tongue frustration, searched through partially forgotten material, reconstructed the idea from fragments — instead receives a complete, fluent explanation from ChatGPT.
In each case, the cognitive operation has shifted from generation to reception. From production to consumption. From the effortful search that builds memory to the easy recognition that leaves memory unchanged. The output may be identical — the debugging strategy may be the same, the argument may be equally well-formed, the explanation may be equally accurate. But the cognitive process that produced the output is fundamentally different, and the learning consequences of that difference are, in Bjork's framework, enormous.
There is a deeper layer to the generation effect that is particularly relevant to creative and professional work. Generation does not merely strengthen the specific item retrieved. It strengthens the entire network of associations that the search process activated. When the developer generates a debugging strategy, she activates not just the specific knowledge of this bug type but her understanding of related bug types, of the architectural patterns that produce this class of error, of the system behaviors she has observed in similar situations. The generation event is a traversal of the knowledge network — and the traversal strengthens every node it touches.
When the same developer receives a debugging strategy from Claude, she processes the strategy — evaluates it, checks whether it makes sense, perhaps modifies it slightly before implementing it. This is not nothing. Evaluation is a cognitive operation. But it is a different operation from generation, and it engages a narrower set of the knowledge network. The developer evaluates the specific solution presented. She does not traverse the broader network of associations that a generation attempt would have activated. The specific solution is processed. The broader expertise is left untouched.
This matters because expertise is not a collection of discrete facts stored independently in memory. Expertise is a network — a richly interconnected web of knowledge, skills, and experiences in which each node is linked to dozens of others through associative connections built over years of practice. The expert does not look up individual pieces of knowledge. She navigates the network, moving from one node to related nodes with a fluency that reflects the density of connections built through years of generation events. Each time she generated a solution rather than receiving one, the traversal strengthened the connections. Each time she struggled with a problem and activated related knowledge in the search for an answer, the network grew denser.
AI tools do not attack individual nodes. They attack the connections. By converting generation into reception, they reduce the number of network traversals the user performs. Each traversal foregone is a set of connections that does not get strengthened. Over time — over months and years of AI-assisted work — the network grows thinner. Not because knowledge is lost but because the connections between knowledge nodes are not maintained. The developer still knows the individual facts. She can still recognize the correct solution when she sees it. But the ability to traverse the network independently — to generate solutions, to see connections between problems, to feel the architectural intuition that comes from a densely connected knowledge base — atrophies.
Segal describes, in The Orange Pill, the creative process behind Bob Dylan's "Like a Rolling Stone": twenty pages of raw, unstructured material — what Dylan called "vomit" — compressed over days into six minutes of precision. The twenty pages were a generation event of extraordinary scale. Dylan traversed his entire knowledge network — every musician he had heard, every poet he had read, every argument he had lost in Greenwich Village coffeehouses — and the traversal produced not just the song but the strengthening of the creative network that would produce every subsequent song. The generation was the creative development. The struggle was the medium.
Had Dylan described the song he wanted and received it from a machine, the product might have been indistinguishable. The process would have been categorically different. And the process is where the learning lives.
This distinction — between identical products and non-identical processes — is the distinction that the AI productivity narrative consistently fails to make. The narrative measures products: lines of code, documents produced, problems solved, revenue generated. These are performance metrics. They measure what was accomplished. They do not measure what was learned in the accomplishing.
Bjork's research insists on the second measurement. Not because learning is more important than production in every context — there are contexts where production is what matters and learning is beside the point — but because the two are different, and confusing them produces systematic errors in judgment about when AI assistance is beneficial and when it is costly.
The error is visible in how organizations evaluate AI adoption. The metric is output per unit time. The developer who produces twice as much code with Claude is evaluated as twice as productive. The evaluation is correct on the performance dimension and silent on the learning dimension. If the developer's independent capability — the capability she could deploy without the tool — has decreased during the same period, the performance gain has been purchased at a cost that no quarterly review captures.
Bjork's generation effect research suggests a specific intervention: the generate-first protocol. Before receiving AI assistance, the user generates their own attempt. The attempt may be incomplete. It may be wrong. It may be rough and unpolished compared to what the AI will produce. The quality of the attempt is not the point. The cognitive traversal that the attempt forces — the activation of the knowledge network, the struggle to produce rather than merely receive — is the point.
This protocol has been tested in educational settings. Students who generate an answer before seeing the correct one remember more than students who see the correct answer immediately, even when the generated answer was wrong. The wrongness does not impair learning. The generation effort compensates for the incorrectness of the response. The student who generated "F-I-S-T" before learning the correct answer was "F-A-S-T" still benefits from the generation attempt — because the search process activated the network, and the correction was encoded more deeply against the backdrop of the failed attempt than it would have been in isolation.
Applied to AI-assisted work, the generate-first protocol would function as follows: the developer confronts a bug and spends fifteen minutes generating her own debugging hypothesis before consulting Claude. The lawyer encounters a legal question and drafts her own preliminary analysis before querying the AI research tool. The writer faces a blank page and produces her own rough outline before asking for AI assistance.
In each case, the generation attempt preserves the cognitive process that AI would otherwise eliminate. The subsequent AI assistance is not diminished — the developer still benefits from Claude's solution, the lawyer still benefits from the AI's research, the writer still benefits from the AI's structural suggestions. But the human has performed the generation event first, and the generation event is where the learning happens.
The cost is time. Fifteen minutes of independent effort before consulting the tool. In a productivity culture measured in output per hour, fifteen minutes is a significant overhead. The quarterly review does not reward fifteen minutes of private struggle that produce an inferior preliminary result. The incentive structure of every organization punishes this protocol, because the protocol reduces performance by exactly the amount of time spent generating independently.
This is the core tension that Bjork's research exposes. The intervention that produces the best long-term outcomes for the human is the intervention that produces the worst short-term metrics for the organization. The difficulty is desirable for learning and undesirable for the quarterly report. The resolution of this tension requires either a redefinition of what organizations measure — expanding evaluation beyond output to include demonstrated independent capability — or a cultural shift in which individuals choose short-term difficulty for long-term development without institutional support.
Neither resolution is likely without deliberate, structural intervention. The market selects for ease. The quarterly review selects for output. The tool selects for fluency. Every selection pressure in the system pushes toward the elimination of generation. And every elimination of generation, as Bjork's research demonstrates with the weight of four decades of evidence, is an elimination of the cognitive process through which durable expertise is built.
The generation effect is not a suggestion. It is a finding. The act of producing is the act of learning. Bypass the production, and you bypass the learning — regardless of how polished the received alternative appears.
The common metaphor for memory is storage. A filing cabinet. A hard drive. A library with shelves. Information goes in, sits there, and comes back out when needed. The metaphor is intuitive, universal, and wrong in ways that matter enormously for understanding what artificial intelligence does to the people who use it.
Robert A. Bjork, working with Elizabeth Ligon Bjork, developed a theory of memory that replaces the storage metaphor with something more complex and more accurate. The theory, known as the New Theory of Disuse, proposes that every item in memory possesses not one strength but two — and that the two strengths respond to different conditions, follow different trajectories, and produce different behavioral outcomes that are routinely confused with each other.
Storage strength is how well an item has been encoded. It reflects the depth and quality of the memory trace — how many associative connections link the item to other knowledge, how thoroughly the item has been integrated into the broader network of understanding. Storage strength increases monotonically: it only goes up. Every encounter with information adds something to storage strength. The additions may be small or large depending on the quality of the encoding event, but they do not reverse. Information that has been deeply encoded through effortful processing possesses high storage strength — it is, in a meaningful sense, permanently part of the person's knowledge base.
Retrieval strength is how easily an item can be accessed right now. It reflects the current activation level of the memory trace — how quickly and fluently the information comes to mind when needed. Unlike storage strength, retrieval strength fluctuates constantly. It rises with recent exposure and falls with the passage of time. The word you studied five minutes ago has high retrieval strength. The word you studied five weeks ago may have low retrieval strength — it is harder to access, slower to come to mind, less fluent when it arrives — even if its storage strength is substantial.
The distinction matters because the two strengths are dissociable and, under many conditions, inversely related in their response to learning interventions. Massed practice — studying the same material repeatedly in a single session — produces high retrieval strength and low storage strength. The material feels accessible. The learner feels confident. But the encoding is shallow, because the ease of retrieval during the session prevented the effortful processing that builds durable traces. Spaced practice produces the opposite pattern: low retrieval strength during the gaps between sessions (the material feels less accessible, the learner feels less confident) and high storage strength (the effortful retrieval during each return visit builds deep, well-connected traces).
The relationship between the two strengths produces a phenomenon that is central to Bjork's account of how AI affects learning. When retrieval strength is high — when the information comes to mind easily — the brain has little reason to engage in the deep processing that builds storage strength. The information is already accessible. Why work harder to encode it? This is the mechanism through which ease undermines durability: high retrieval strength reduces the need for the effortful processing that produces high storage strength. The easier it is to access information, the less deeply it is encoded when accessed.
This mechanism has been operating since long before artificial intelligence. Massed practice, blocked study, immediate feedback — all of these traditional learning conditions produce high retrieval strength at the expense of storage strength. But AI has amplified the mechanism to an extent that no previous technology has approached. AI tools produce, for all practical purposes, infinite retrieval strength. Any piece of information, any solution to any problem, any analysis of any question is accessible in seconds, from any device, at any time. The retrieval strength available to an AI-augmented human is, functionally, unlimited.
The cost is borne entirely by storage strength. And storage strength is the dimension of memory that matters for independent capability.
Consider the developer who has used Claude Code for six months. She has solved hundreds of problems. She has produced thousands of lines of working code. Her performance metrics are excellent — she ships features faster, handles a wider range of tasks, receives positive reviews from colleagues who see only her output. Her retrieval strength, augmented by the tool, is extraordinarily high. She can access any solution within seconds.
Now remove the tool. Not punitively — perhaps the API is down for a day, or she is in a meeting where she needs to reason about system architecture without her laptop, or she is interviewing at a company that administers whiteboard coding challenges. In these moments, the only memory that matters is the memory she owns — the storage strength that was built through her own encoding events, her own effortful retrieval, her own generation of solutions from her knowledge base rather than from the tool's.
If her storage strength has been growing alongside her performance, the tool removal is a minor inconvenience. She can access less information less quickly, but the deep encoding of six months of practice supports capable independent performance. The tool augmented a strong foundation.
If her storage strength has been stagnating or declining — because the tool's infinite retrieval strength eliminated the need for the effortful processing that builds storage — the tool removal reveals a gap. The performance that looked like expertise was, in part, borrowed from the tool. The confidence that felt like competence was, in part, an artifact of retrieval strength that belonged to Claude rather than to the developer.
Bjork's framework does not predict which scenario occurs in every case. It predicts which conditions produce which outcome. When the tool is used in ways that preserve generation and effortful retrieval — when the developer struggles with the problem before consulting AI, when the gaps between AI-assisted sessions allow partial forgetting that forces deeper encoding on subsequent attempts — storage strength grows alongside retrieval strength. The tool augments without replacing. When the tool is used as the first resort, the instant answer, the path of least cognitive resistance, retrieval strength is maintained entirely by the tool and storage strength receives little or no investment. The tool substitutes rather than augments.
The distinction between augmentation and substitution is, in Bjork's terms, the distinction between conditions that build both storage and retrieval strength and conditions that build only retrieval strength. The first produces a human who is more capable with the tool and retains capability without it. The second produces a human whose capability exists only in the presence of the tool — a capability that is real in the moment and illusory in the aggregate.
This maps onto Segal's account in The Orange Pill of the natural-language interface revolution. When the machine learned to speak human language, the translation barrier between intention and execution collapsed. Segal describes this as a liberation — the developer no longer needed to think in code-shaped thoughts but could think in human-shaped thoughts and have the machine handle the translation. Bjork's framework accepts the liberation as real while asking what happens to the translator.
The developer who spent years translating her intentions into code was performing thousands of generation events per day. Each translation was a retrieval: pulling syntactic knowledge, architectural patterns, debugging heuristics from memory and constructing a solution. Each translation was an encoding event: the struggle of implementation deposited layers of understanding that accumulated into architectural intuition. The translation was tedious. It was also the cognitive operation through which storage strength was built.
When the translation is eliminated — when natural language replaces programming language as the interface — the generation events per day drop precipitously. The developer still makes decisions. She still exercises judgment. But the number of moments per day in which she must retrieve, construct, and produce from her own cognitive resources has been reduced by an order of magnitude. The retrieval strength is maintained by the tool. The storage strength receives fewer deposits. The account balance looks the same from the outside — the developer is producing at a high level — but the composition has changed. More of the balance is held in the tool's account and less in the developer's.
There is an analogy from a different domain that clarifies the mechanism. Consider the difference between a person who navigates a city using a paper map and a person who navigates using GPS turn-by-turn directions. The GPS user arrives at the destination more efficiently. The performance metric — time to destination, accuracy of route — favors the GPS user. But the map user has built a cognitive map of the city. She can navigate without the tool. She can take detours, recognize shortcuts, understand spatial relationships between neighborhoods. Her storage strength for the city's geography is high because the effortful process of reading, interpreting, and making decisions from the map built deep spatial encoding.
The GPS user has high retrieval strength — the tool retrieves the route instantly — and low storage strength. She cannot describe the route she just took. She cannot navigate to a nearby destination without the tool. She arrived efficiently and learned nothing about the terrain. This finding has been empirically documented: GPS users develop weaker spatial memory than map users, even after navigating the same routes.
AI tools are GPS for cognition. They get you to the destination faster. They get you there with less effort. They get you there with higher performance on every metric that measures the journey. And they leave the traveler with less knowledge of the terrain than she would have acquired through the slower, harder, more effortful process of navigating without assistance.
The New Theory of Disuse adds one more element that is particularly relevant to the AI transition: the relationship between retrieval strength and the benefit of a retrieval event. When retrieval strength is already high — when the information is easily accessible — a retrieval event produces little additional storage strength. The information is already activated; retrieving it again does not require the effort that drives deeper encoding. When retrieval strength is low — when the information has begun to fade, when retrieval requires genuine cognitive work — a successful retrieval event produces substantial storage strength gains. The harder the retrieval, the more the encoding benefits.
This is the formal mechanism behind the spacing effect. The gap between practice sessions allows retrieval strength to decay. The decay makes the subsequent retrieval harder. The harder retrieval produces greater storage strength gains. The difficulty of the retrieval is not an obstacle to learning. It is the engine of learning. And the engine runs on the fuel of partial forgetting.
AI tools never allow partial forgetting to occur. The information is always there, always accessible, always at high retrieval strength. The engine of deep encoding — the engine that runs on the difficulty of retrieval across a gap of partial forgetting — is never engaged. Storage strength receives no investment from the hardest-working engine in the learning system.
Bjork has observed that the human memory system possesses, for practical purposes, unlimited storage capacity. This is a remarkable feature of biological memory, and it distinguishes human memory from every computational memory system ever built. Unlimited storage capacity means that the problem of human memory is never "running out of space." The problem is always retrieval — finding what has been stored among the vast archive of everything that has ever been encoded.
Forgetting, in this framework, is not the loss of stored information. It is the loss of access to stored information. The memory trace remains, encoded at whatever storage strength the original processing produced, but the retrieval routes that lead to it have weakened through disuse. The information is there. The path to it has grown over.
AI tools have no forgetting. Every piece of information remains at full retrieval strength indefinitely. This is computationally efficient and cognitively unnatural. The human memory system was designed by evolution to forget — not as a failure mode but as an adaptive feature. Forgetting clears the retrieval landscape of outdated information, reduces interference between competing memories, and — most relevantly for Bjork's framework — creates the conditions under which subsequent retrieval events produce maximal learning.
A memory system that never forgets is a system that never creates the conditions for its own deepening. It is perpetually accessible and perpetually shallow. It has infinite retrieval strength and whatever storage strength it happened to build during the initial encoding — which, in the absence of forgetting-and-retrieval cycles, may be very little.
This is the system that AI creates for its users. Not a system without knowledge, but a system with a specific and pathological distribution of knowledge: vast retrieval, thin storage. The user can access anything. The user has encoded little. The access feels like understanding, and the feeling is wrong in the precise way that Bjork's metacognitive research predicts.
What would it mean to design AI tools that respect the two-strength architecture of human memory? It would mean building tools that deliberately allow retrieval strength to decay before offering assistance — tools that wait, that do not answer immediately, that create the gap in which partial forgetting produces the conditions for deep encoding. It would mean tools that track the user's retrieval history and increase the delay before assistance for topics the user has been assisted with recently, forcing the effortful retrieval that builds storage strength. It would mean tools that administer periodic dependency audits — moments where the tool withdraws and the user's independent retrieval capability is measured, not as a punishment but as a diagnostic, the way a physician measures a patient's unassisted blood pressure before adjusting the medication.
These designs are feasible. They are not being built, because they would make the tools feel worse to use. A tool that waits before answering feels slower than a tool that answers instantly. A tool that increases delay based on prior usage feels unresponsive. A tool that periodically withdraws feels unreliable. Every design choice that respects the two-strength architecture of human memory produces a user experience that the market interprets as inferior.
The market is measuring retrieval strength. It should be measuring storage strength. The distance between what the market measures and what the science says matters is the distance between a civilization that uses AI to become more capable and a civilization that uses AI to feel more capable while becoming less so — one fluent, frictionless, confidently wrong interaction at a time.
---
In 1993, K. Anders Ericsson, Ralf Krampe, and Clemens Tesch-Römer published a study of violinists at the Berlin Academy of Music that became one of the most cited papers in the psychology of expertise. They divided students into three groups based on the assessment of their professors: the best violinists, the good violinists, and the future music teachers. Then they measured how the groups had spent their time over the preceding decade.
The finding was striking in its clarity: the primary difference between the groups was not talent, not opportunity, not the quality of instruction. It was the accumulated amount of what Ericsson called deliberate practice — practice that was effortful, focused on the areas of greatest weakness, conducted at the edge of current capability, and guided by informed feedback. By age twenty, the best violinists had accumulated an average of ten thousand hours of deliberate practice. The good violinists had accumulated around seven thousand five hundred. The future teachers around five thousand.
The number itself became famous — popularized as the "ten-thousand-hour rule" in ways that simplified and occasionally distorted Ericsson's finding. But the underlying mechanism, viewed through Bjork's framework, reveals something more specific and more consequential than a quantity threshold. Deliberate practice is not merely a large amount of time spent doing something. It is time spent engaging with desirable difficulties at the edge of capability.
The violinist practicing a passage she can already play perfectly is not engaging in deliberate practice. She is performing. The practice feels good. The sound is pleasant. The metacognitive signal says: this is going well. But the learning is minimal, because the difficulty is absent. The violinist practicing a passage she cannot yet play — struggling with a fingering that does not yet feel natural, working through a bow transition that keeps coming out ragged, failing repeatedly at a tempo she cannot yet sustain — is engaging in deliberate practice. The practice feels bad. The sound is unpleasant. The metacognitive signal says: this is going poorly. But the learning is substantial, because the difficulty forces the deeper processing through which skill develops.
Deliberate practice, examined through Bjork's lens, is substantially composed of desirable difficulties. The spacing of practice sessions across days and weeks, with gaps that force retrieval of technique that has begun to fade. The interleaving of different pieces and different technical challenges within a single session, preventing the false fluency that comes from repeating the same passage in a block. The variation of practice conditions — different tempos, different dynamics, different rooms with different acoustics — that produces the flexible motor programs needed for performance under the variable conditions of a concert hall. The delayed feedback of hearing a recording of one's own playing hours or days later, when the metacognitive distortions of real-time self-monitoring have had time to settle.
This is the architecture through which expertise is built — not just in music but in every domain that has been studied. Medical diagnosis. Chess. Software engineering. Legal reasoning. Scientific research. Athletic performance. In every case, the expert is the person who has engaged with desirable difficulties systematically, at the edge of capability, over an extended period. The expert has not avoided struggle. The expert has specialized in it.
The question that the AI transition forces is not what happens to existing experts. Existing experts — the senior engineer, the experienced physician, the master musician — possess storage strength built through decades of desirable difficulty. That storage strength is real, deep, and not easily eroded by the availability of new tools. The existing expert who uses AI is augmented by it. Her judgment, built through years of generation events, directs the tool. The tool amplifies capability that was already substantial.
The question is what happens to the novice. The person who has not yet built the storage strength. The person who is at the beginning of the ten-thousand-hour trajectory. The person for whom the desirable difficulties of early practice are not merely helpful but constitutive — they are the process through which expertise forms in the first place.
AI tools, as currently designed, offer the novice a shortcut past the difficulties that the expert had to traverse. The junior developer does not need to struggle with debugging — Claude provides solutions. The medical student does not need to struggle with differential diagnosis — an AI diagnostic tool narrows the possibilities. The law student does not need to wrestle with case analysis — an AI research assistant provides the relevant precedents, summarized and organized.
The shortcuts are real. The novice who takes them produces better output in the short term than the novice who does not. The junior developer who uses Claude ships features faster than the junior developer who debugs independently. The performance metrics favor the shortcut — the same performance metrics that four decades of Bjork's research have shown to be unreliable indicators of learning.
But performance is being measured in the short term, and the learning that is being foregone is the learning that would have produced expertise in the long term. Each desirable difficulty bypassed is a deposit not made into the storage-strength account. The junior developer who uses Claude for every debugging task has not built the diagnostic network — the densely interconnected web of error patterns, system behaviors, architectural intuitions — that a decade of independent debugging would have produced. The medical student who relies on AI diagnosis has not built the associative pathways between symptoms, conditions, and clinical presentations that decades of deliberate diagnostic practice would have wired.
The expertise has not been replaced by the tool. It has never formed. There is nothing to replace. The storage-strength account is low not because withdrawals were made but because deposits were never accumulated. The novice has skipped the developmental phase that produces the expert, and the skip is invisible because the tool compensates for the missing capability in real time.
The compensation is seamless. That is precisely the problem. The junior developer with Claude looks, from the outside, like a capable developer. The output is correct. The features ship. The code reviews pass. The performance metrics confirm: this person is productive. But the productivity is a joint product of the person and the tool, and the share of the joint product attributable to the person may be declining with each AI-assisted task — because each task that the tool handles is a desirable difficulty that the person's cognitive architecture did not engage with.
This is the generation gap that Bjork's framework predicts. Not a gap between old experts and young novices, but a gap between the process that built the current generation of experts and the process that is building the next. The current generation was forged through friction. The next generation is being formed through fluency. And fluency, as the entire body of desirable-difficulty research demonstrates, does not produce the same cognitive architecture as friction.
There is a version of this concern that is merely nostalgic — the old guard insisting that the young must suffer as they suffered. That is not what the research supports. The research does not say that suffering is good. It says that specific cognitive operations — retrieval, generation, discrimination, self-assessment — are necessary for the development of expertise, and that these operations are experienced as difficult. The suffering is incidental. The cognitive operations are essential. The question is whether AI-assisted development preserves those operations or eliminates them.
Segal describes, in The Orange Pill, a senior engineer in Trivandrum whose decades of implementation experience compressed into a judgment layer that directed AI tools with extraordinary effectiveness. The judgment was real, irreplaceable, and the product of years of desirable difficulty. Segal recognizes the value of that judgment while celebrating the productivity gains that AI enables. Bjork's framework adds a temporal dimension to this recognition: the senior engineer's judgment exists because she was formed in an era that required her to struggle. The junior engineer being formed in the current era, using the same tools under the same productivity pressures, is not engaged in the same cognitive operations. The judgment that the senior engineer possesses may not form in her junior colleague — not because the junior colleague is less talented, but because the environment has been stripped of the difficulties through which judgment develops.
This is not a theoretical prediction. It is an empirical concern grounded in decades of research on how expertise forms. The research is clear: expertise is built through accumulated encounters with desirable difficulty at the edge of current capability. Reduce the difficulty, and you reduce the rate of expertise formation. Eliminate the difficulty, and you eliminate the conditions under which expertise can form at all.
The solution is not to withhold AI from novices. Novices need AI, perhaps more than experts do, because AI expands what a novice can attempt and therefore what a novice can learn from. The solution is to structure AI use in ways that preserve the desirable difficulties that expertise requires. The generate-first protocol — struggle before assistance. Progressive challenge frameworks — increasing the complexity of unassisted tasks as competence grows. Mentoring structures that pair AI-assisted production with human-guided reflection on the process, not just the product, of the work.
These structures require institutional commitment. They require organizations to invest in the long-term development of their people rather than the short-term maximization of their output. They require educational institutions to redesign curricula around the preservation of productive struggle rather than the celebration of frictionless production. They require a cultural shift in which difficulty is understood not as an obstacle to be removed but as a resource to be managed — directed toward the cognitive operations that build expertise and away from the tedium that merely wastes time.
The distinction is critical, and Bjork has been explicit about it throughout his career: not all difficulty is desirable. Poorly designed interfaces, confusing instructions, irrelevant obstacles — these difficulties do not engage the cognitive processes that build expertise, and removing them is unambiguously beneficial. The question is always whether the difficulty engages effortful retrieval, generation, discrimination, or self-assessment. When it does, the difficulty is doing cognitive work. When it does not, it is merely irritating.
AI tools remove both kinds of difficulty indiscriminately. They remove the tedious and the formative with equal efficiency. The task for individuals, organizations, and institutions is to distinguish between the two — to welcome AI's removal of undesirable difficulty while preserving, deliberately and structurally, the desirable difficulties through which the next generation of experts must be formed.
The expertise that never forms is a loss that produces no signal. There is no alarm that sounds when a junior developer fails to build diagnostic intuition. There is no metric that captures the absence of expertise that would have existed had the developmental path included more struggle. The loss is visible only on the delayed test — the moment when the tool is unavailable and the person's independent capability is revealed.
By then, the investment period has passed. The deposits were never made. And no amount of retroactive struggle can build what daily practice over years would have deposited, one desirable difficulty at a time.
---
Everything in the previous six chapters has been diagnosis. The finding is clear. The mechanism is understood. The evidence is replicated across domains, populations, and decades of research. AI tools, as currently designed, systematically eliminate the desirable difficulties through which human beings develop durable expertise. The metacognitive illusion of fluency conceals the loss. The performance metrics that govern the technology industry, the education system, and the modern workplace fail to capture it. The result is a civilization that is becoming measurably more productive and possibly less capable — more output, less understanding; more performance, less learning.
The diagnosis is uncomfortable. The prescription is harder still, because it requires building tools that feel worse to use in a market that rewards tools that feel better.
But the prescription is specific. Bjork's four decades of research do not merely identify the problem. They specify, with experimental precision, the conditions that produce durable learning. And those conditions can be translated into design principles for AI tools that preserve desirable difficulty rather than eliminating it.
The first principle is the generate-first protocol, and it is the simplest to implement and the hardest to adopt. The principle is straightforward: before the AI provides its answer, the user produces their own attempt. The user generates a debugging hypothesis before Claude offers one. The user drafts a preliminary argument before the AI assistant structures one. The student formulates an answer before ChatGPT delivers one.
The cognitive basis for this protocol is the generation effect: information that is produced by the learner is encoded more deeply than information that is received. The generation attempt activates the knowledge network, forces retrieval of related concepts, and creates the cognitive conditions under which the subsequent AI-provided answer is processed more deeply. Even a wrong generation attempt — a debugging hypothesis that turns out to be incorrect, a preliminary argument that misses the key point — produces learning benefits that immediate AI assistance does not.
Implementation could take several forms. A tool that requires a minimum-length user response before generating its own. A setting that introduces a mandatory delay — thirty seconds, two minutes, five minutes — between the user's query and the AI's response, during which the user is prompted to write their own attempt. A mode that asks, "What is your initial approach to this problem?" and refuses to proceed until the user has articulated one.
Each of these implementations would reduce the perceived responsiveness of the tool. Each would introduce friction into an interaction designed to be frictionless. Each would produce user complaints, lower satisfaction scores, and competitive disadvantage against tools that respond instantly. Each would also, according to the evidence, produce users who learn more and develop greater independent capability.
The second principle is partial scaffolding. Instead of providing complete solutions, the AI provides frameworks, hints, and partial answers that require the user to complete the cognitive work. A debugging assistant that identifies the category of error but not the specific fix — forcing the developer to apply the categorization to her specific codebase. A research tool that identifies relevant cases but does not summarize them — requiring the lawyer to read and synthesize the material. A writing assistant that suggests structural options but does not fill them in — requiring the writer to generate the content that gives the structure meaning.
Partial scaffolding preserves what educational psychologists call the "zone of proximal development" — the space between what the learner can do independently and what the learner can do with assistance. The zone is where learning happens. Full solutions collapse the zone by doing everything. No assistance leaves the learner outside the zone, struggling with tasks beyond current capability. Partial scaffolding keeps the learner inside the zone — supported enough to make progress, challenged enough to grow.
The implementation requires sophistication. The AI must assess the user's current capability level — either explicitly, through periodic assessments, or implicitly, through the pattern of the user's prior interactions — and calibrate its assistance to land inside the zone. Too much scaffolding, and the user is receiving rather than generating. Too little, and the user is frustrated without learning benefit. The calibration is dynamic: as the user develops, the scaffolding should thin, exposing more of the cognitive work to the user and less to the tool.
This is technically feasible. Adaptive tutoring systems have been calibrating difficulty to learner level for decades. The AI platforms that cite Bjork's research in their design documents — the intelligent tutoring systems that implement spaced retrieval and adaptive difficulty — demonstrate that the engineering is within reach. The barrier is not technical. It is commercial: the adaptive system that adjusts difficulty upward as the user improves feels harder to use over time, which is exactly the wrong trajectory for user satisfaction metrics and exactly the right trajectory for learning.
The third principle is spaced assistance. Rather than providing help on demand — any question, any time, instant response — the AI distributes its assistance across time, introducing gaps between interactions during which the user must rely on independent cognition. The implementation could be as simple as a daily assistance budget: the tool answers a set number of queries per day, forcing the user to prioritize which questions merit AI assistance and which can be resolved through independent effort. Or it could be more sophisticated: the tool tracks which topics the user has received assistance on and increases the delay before providing further assistance on the same topic, forcing the spaced retrieval that builds storage strength.
The spacing principle is the most directly grounded in Bjork's research. The spacing effect is the oldest, most replicated, and most robust finding in the learning sciences. The mechanism is understood: the gap between assistance events allows retrieval strength to decay, and the effortful retrieval forced by the decay builds storage strength that immediate assistance never does. Every gap in which the user must function without the tool is a potential learning event — a moment when the brain must retrieve, reconstruct, and generate from its own resources.
AI tools currently allow no gaps. The tool is always available, always responsive, always ready. The user never experiences the productive discomfort of a moment when the answer is not instantly accessible. The spacing principle would reintroduce those moments — deliberately, structurally, in service of the storage strength that continuous access prevents.
The fourth principle is interleaved challenge modes. The AI alternates between assisted and unassisted tasks within a single work session, mixing problems for which AI assistance is available with problems for which the user must work independently. The mixing forces the discrimination that interleaving produces: the user must determine, for each problem, whether assistance is available and adjust their cognitive strategy accordingly. The unassisted problems force generation and retrieval. The assisted problems provide models and feedback. The alternation produces the kind of flexible, transferable knowledge that blocked assistance — all-assisted or all-unassisted — does not.
Implementation could take the form of periodic "solo rounds" within an otherwise AI-assisted workflow: every fourth task, the AI withdraws and the user works independently. Or it could be problem-type-dependent: the AI assists with novel problem types but withholds assistance on problem types the user has encountered before, forcing independent retrieval of previously learned solutions.
The fifth principle addresses the metacognitive illusion directly: confidence calibration. Before the AI reveals its answer, the user is asked to rate their confidence in their own response. How certain are you that your approach is correct? After the AI provides its answer, the user sees both their confidence rating and their actual accuracy. Over time, the system builds a calibration profile — a record of the gap between the user's confidence and their actual performance.
This is metacognitive training. It does not change what the user knows. It changes how accurately the user monitors what they know. Over weeks and months of calibration feedback, the user develops a more accurate sense of the boundary between what they understand and what they merely feel they understand. The fluency illusion — the tendency to mistake the ease of AI-provided answers for the depth of personal understanding — is weakened by repeated exposure to the gap between predicted and actual independent performance.
The sixth principle brings these together into a structural intervention that Bjork's framework logically implies: the dependency audit. Periodically — weekly, monthly, quarterly — the tool withdraws entirely and the user's independent capability is assessed. Not as a punishment. Not as a threat. As a diagnostic. The way a physician periodically measures unassisted blood pressure to determine whether the medication is treating the condition or merely masking it.
The dependency audit answers the question that no performance metric captures: what can this person do without the tool? Is the augmentation building independent capability, or is it substituting for capability that is not developing? Is the user's storage strength growing alongside the retrieval strength that the tool provides, or has the tool's retrieval strength become a replacement for, rather than a complement to, the user's own?
The audit results are not a judgment on the user. They are a judgment on the system — the configuration of tool use, difficulty preservation, and learning support that the user has been operating within. A user whose independent capability is declining is not failing. The system is failing the user. The tool is configured in a way that substitutes for rather than builds human capability. The audit detects this failure before its consequences become irreversible.
The commercial objection to these principles is obvious and powerful. The market rewards ease. Users choose the tool that responds instantly, provides complete solutions, is always available, and never makes them feel uncertain about their own capability. Every principle outlined in this chapter moves in the opposite direction: toward delay, toward partial solutions, toward restricted availability, toward the deliberate cultivation of productive uncertainty.
The tool that implements these principles will feel worse to use than the tool that does not. It will score lower on user satisfaction surveys. It will lose market share to competitors that provide instant, complete, always-available, confidence-affirming responses. The market will select against the very features that the evidence says are necessary for durable human development.
This is the structural version of the paradox that Bjork has been documenting for forty years: the conditions that produce the best long-term outcomes feel worst in the short term, and the conditions that feel best in the short term produce the worst long-term outcomes. Translated from the laboratory to the marketplace, the paradox becomes an economic prediction: without intervention, the market will optimize AI tools toward maximum ease and minimum learning.
The intervention could come from regulation — requirements that AI tools designed for educational or professional development contexts incorporate desirable-difficulty features. It could come from institutional policy — organizations requiring that AI tools deployed to their employees include generate-first protocols, spacing constraints, and periodic dependency audits. It could come from the tools themselves — AI companies choosing, against short-term commercial interest, to build products that prioritize long-term human development over immediate user satisfaction.
Or it could come from individuals. From the developer who chooses to struggle with the bug for fifteen minutes before asking Claude. From the student who writes her own draft before consulting ChatGPT. From the professional who tracks her independent capability over time and adjusts her AI usage when the trajectory points downward. From anyone who understands that the easy path produces the feeling of progress and the hard path produces the reality of it, and who chooses, deliberately and repeatedly, the path that feels worse and works better.
The designs exist. The evidence supports them. The mechanisms are understood. The engineering is feasible. What remains is the harder problem: whether the market, the institutions, and the individuals will choose to build tools that teach rather than merely perform — knowing that the teaching feels like friction, and the performance feels like flight.
---
A teacher stands in front of a classroom and faces a problem that no previous generation of educators has confronted.
Her students have access to a tool that can answer any question she asks. Not approximately. Not eventually. Right now, with remarkable accuracy, in polished prose, with citations, with structural clarity that most of her students could not produce independently. The tool is in their pockets. It is on their laptops. It is woven into the operating system of the devices they carry as naturally as car keys.
She can ban the tool. Many of her colleagues have tried. The bans are unenforceable except under examination conditions, and even there, the enforcement requires a degree of surveillance that transforms the educational relationship into something adversarial. The ban also carries a pedagogical cost: students who are prevented from using AI in school will use it the moment they leave, and the school will have taught them nothing about how to use it wisely. Prohibition teaches avoidance, not judgment.
She can embrace the tool. Many of her colleagues have done this, too, redesigning assignments around AI-assisted production, teaching students to prompt effectively, celebrating the output that AI enables. The embrace produces impressive artifacts — essays that are more polished, projects that are more ambitious, presentations that are more visually sophisticated than anything her students could have produced alone. The performance metrics improve. The portfolios gleam.
But Bjork's research suggests that neither the ban nor the uncritical embrace addresses the actual problem. The problem is not the tool. The problem is what happens to the cognitive processes that produce learning when the tool eliminates the conditions under which those processes occur.
The educator's dilemma, stated precisely, is this: the tools that make students produce the best output are the tools that may cause students to learn the least. And the conditions that produce the deepest learning are the conditions that make students feel least productive, least confident, and most frustrated.
This is not a technology problem. It is a learning-science problem that technology has made urgent.
Bjork's research provides a specific lens for examining what happens in a classroom where AI assistance is freely available. The student who uses ChatGPT to write an essay skips the generation event — the struggle of finding the right word, building the argument from scattered thoughts, discovering through the friction of writing on a blank page what one actually thinks about the topic. The essay exists. The understanding does not. The student who uses an AI research tool to summarize source material skips the retrieval practice — the effortful process of reading, synthesizing, connecting, and reconstructing the material in one's own words. The summary exists. The encoding does not. The student who uses AI to solve a problem set skips the interleaving and the error-driven learning — the productive failures, the wrong turns, the discrimination between problem types that builds the conceptual framework for understanding when to apply which approach. The solutions exist. The mathematical thinking does not.
In each case, the performance is present and the learning is absent. And the metacognitive illusion ensures that the student cannot tell the difference. The essay feels like understanding because it reads like understanding. The summary feels like knowledge because the material is now familiar. The solutions feel like competence because they are correct.
The teacher who evaluates the output sees performance. The delayed test — administered weeks later, without AI assistance, under conditions that require independent retrieval and application — would reveal the learning. But the delayed test is rarely administered, because the curriculum moves forward, the grades are recorded, and the institutional incentive structure rewards coverage over depth.
Bjork's framework suggests a pedagogical reorientation that is radical in its simplicity: stop evaluating what students produce and start evaluating how students think.
The teacher Segal describes in The Orange Pill — the one who stopped grading essays and started grading questions — was implementing this reorientation intuitively. The assignment was not to produce a polished artifact but to produce the five questions that would need to be answered before a meaningful artifact could be created. The questions require the student to engage in a cognitive operation that AI cannot perform on her behalf: identifying what she does not know.
This is a metacognitive operation of the highest order. Knowing what you know is easy — you retrieve it, you recognize it, you feel the fluency. Knowing what you do not know requires confronting the boundary between knowledge and ignorance, sitting with the discomfort of incompleteness, and mapping the territory of one's own uncertainty. It is precisely the operation that Bjork's research on metacognitive illusions identifies as the most resistant to development and the most important for self-directed learning.
AI tools are extremely good at answering questions. They are unable to originate the questions that matter — the questions that arise from the specific intersection of a person's knowledge, ignorance, curiosity, and circumstances. "What am I for?" is not a prompt. It is the product of a particular twelve-year-old lying in bed after watching a machine do her homework. No AI generated that question, because no AI has her specific combination of capability, limitation, and existential concern.
Teaching students to generate such questions — to identify the boundaries of their understanding, to formulate inquiries that open productive lines of investigation, to develop the metacognitive accuracy that prevents the fluency illusion from substituting for genuine comprehension — is the educator's core task in the AI age. Everything else that the teacher has traditionally done — delivering content, grading output, assessing mastery through performance metrics — is either automatable or misaligned with the learning science.
The pedagogy of desirable difficulty in the AI age would look different from the pedagogy that preceded it. Several specific approaches emerge from Bjork's framework.
Generation before delegation. Before any AI-assisted assignment, the student produces their own attempt. A first draft before consulting a writing assistant. A preliminary analysis before querying a research tool. A set of candidate solutions before asking for algorithmic help. The attempt is graded not on its quality — which may be rough, incomplete, and inferior to what the AI will produce — but on its evidence of cognitive engagement: Does the attempt demonstrate that the student has activated their knowledge network? Has the student struggled productively with the material? Is there evidence of retrieval, of construction, of the cognitive work that produces encoding?
Spaced AI access. Rather than continuous availability, AI tools are available during specific phases of the learning process and unavailable during others. The initial engagement with new material occurs without AI assistance, forcing the effortful encoding that produces storage strength. AI is introduced after a delay — after the student has struggled, has encoded initial impressions, has identified the boundaries of their understanding. The AI assistance then functions as elaborative feedback on an already-encoded foundation, deepening understanding rather than substituting for it.
Interleaved assessment. Evaluation alternates between AI-assisted and AI-unassisted tasks. The AI-assisted tasks demonstrate what the student can accomplish with the tool — a legitimate competency in a world where the tool exists. The AI-unassisted tasks demonstrate what the student has actually learned — the independent capability that persists when the tool is removed. Both assessments are necessary. Neither alone captures the full picture. But the unassisted assessment is the one that reveals learning, and it is the one that institutional incentive structures currently underweight.
Calibration exercises. Regular, low-stakes exercises in which students predict their performance before demonstrating it. How well do you think you understand this concept? Write your answer, then check it against the source material. Rate your confidence in your solution, then compare it to the AI's solution. Track your calibration over time: is your confidence converging with your actual accuracy, or is it drifting upward as the fluency illusion compounds?
These approaches share a common structure: they use AI tools while preserving the cognitive operations that AI tools tend to eliminate. They do not ban the technology. They do not celebrate it uncritically. They treat it as a powerful intervention that must be designed into the learning environment with the same care that a pharmacologist brings to dosing a potent medication — effective at the right dose, toxic at the wrong one, and dangerous when self-administered without monitoring.
The institutional barriers to this approach are substantial. Universities are evaluated on graduation rates, job placement, and student satisfaction — performance metrics, every one. A university that implements desirable-difficulty pedagogy may see lower student satisfaction (the learning feels harder), lower short-term performance metrics (students produce less polished output during the learning phase), and higher rates of productive struggle that students may interpret as pedagogical failure. The institutional incentive structure rewards the easy path — the same easy path that the learning science identifies as the path to shallow, fragile, non-transferable understanding.
The misalignment between institutional metrics and learning science is not new. It predates AI by decades. Bjork has been documenting it for his entire career — the gap between what institutions measure and what the evidence says matters. But AI has widened the gap to a chasm. Before AI, the easy path produced somewhat worse learning than the difficult path. With AI, the easy path may produce no learning at all, because the tool can substitute entirely for the cognitive operations that produce encoding. The student who uses AI for every assignment may graduate with a portfolio of impressive artifacts and no independent capability whatsoever.
The educator standing in front of her classroom knows this. She knows it not from reading Bjork's papers — though the papers would confirm her intuition — but from the specific, granular, embodied knowledge that comes from watching students learn. She can see the difference between a student who has struggled with an idea and a student who has received one. The struggled-with idea lives differently in the student's mind — it connects to other ideas, it generates questions, it produces the specific quality of engagement that experienced teachers recognize and that no performance metric captures.
She is watching that quality become rarer.
The students are producing more. They are producing faster. The output is better by every measurable standard. And the quality of engagement — the cognitive depth that she knows, from years of teaching, is the precondition for everything that matters about education — is thinning.
She is standing in the crucible: the space where the market's demand for performance collides with the science of how human beings actually learn. The market will not resolve this collision. The institutions, governed by the wrong metrics, will not resolve it. The students, captured by the metacognitive illusion that fluent production equals genuine understanding, will not resolve it.
The teacher must resolve it. In her classroom. With her students. One assignment at a time. One desirable difficulty at a time. One generation event at a time. Armed with forty years of evidence that says the hard path is the right path, and surrounded by an institutional and technological environment that makes the easy path irresistibly accessible.
That is the crucible. And the question it poses — whether a single educator, armed with evidence, can build structures that preserve learning in an environment optimized for performance — is a version of the question this entire analysis has been approaching. Whether the evidence will be enough. Whether the dams can be built. Whether the difficulties that produce understanding will survive in a world that has learned to produce without understanding.
The answer, as Bjork's four decades of research have demonstrated with painful clarity, depends entirely on whether the people who understand the science are willing to act on it — even when acting on it means choosing the path that feels worse, produces lower satisfaction scores, generates fewer polished artifacts, and requires explaining, over and over, to students and administrators and parents, that the struggle is not a failure of the educational system.
The struggle is the educational system. It always was.
The Berkeley researchers who embedded themselves in a two-hundred-person technology company for eight months discovered something that the productivity dashboards had missed entirely. AI-augmented workers were not simply doing more of the same work faster. They were doing different work — broader work, work that crossed role boundaries, work that filled every gap the efficiency had created. The researchers called the phenomenon "task seepage": AI-accelerated work colonizing previously protected cognitive spaces. Lunch breaks. Waiting rooms. The two-minute interval between meetings that had once served, invisibly, as a moment of cognitive rest.
Bjork's framework provides the mechanism that the Berkeley study measured but could not explain. The workers were performing more. Were they learning more? The question was outside the study's scope, but Bjork's four decades of evidence suggest a specific and uncomfortable answer.
Performance and learning respond to different conditions. Performance improves with practice that is massed, blocked, and supported by immediate feedback — precisely the conditions that AI-assisted work provides. Learning improves with practice that is spaced, interleaved, and punctuated by delays during which the learner must generate independent assessments. The Berkeley workers were operating under conditions optimized for performance. The learning conditions — the spacing, the interleaving, the generation events that build durable expertise — were being systematically displaced by the very productivity gains that made the workers feel more capable.
The feeling was real. The capability growth was uncertain.
The professional development implications extend across every knowledge-work domain. The junior developer who uses AI for every debugging task is developing rapidly as an AI operator — learning to prompt effectively, to evaluate AI-generated solutions, to iterate on outputs. These are genuine skills in a world where AI tools exist. But they are different skills from the ones that debugging practice would have built: the diagnostic reasoning, the pattern recognition across error types, the architectural intuition that comes from traversing one's own knowledge network thousands of times in the search for solutions.
The distinction matters because the two skill sets have different durability profiles, different transfer properties, and different value in the moments that define careers. The AI-operation skills are valuable when the tool is available. The debugging skills are valuable always — in interviews, in production emergencies, in architectural discussions, in mentoring conversations, in every context where the professional must reason independently about systems.
Bjork's storage-retrieval framework clarifies what is happening to the professional's cognitive architecture. The AI tool provides unlimited retrieval strength: any piece of information, any solution pattern, any analytical framework is accessible in seconds. This external retrieval strength is real and valuable. But it builds no storage strength in the user. The knowledge accessed through the tool is not encoded through the user's own cognitive processes. It is borrowed. And borrowed knowledge has a specific property that owned knowledge does not: it disappears when the lending institution closes.
The dependency is invisible during normal operations because the tool is always available. The developer who has used Claude for six months performs at a high level every day. The performance metrics confirm it. The code reviews pass. The features ship. The quarterly review is positive. Nothing in the feedback loop signals that the developer's independent capability — the capability she could deploy without the tool — may be stagnating or declining.
The signal appears only in the moments the tool is unavailable. The whiteboard interview. The production incident at two in the morning when the API is down. The architectural discussion with a colleague who needs her to reason about system design without consulting a machine. In these moments, the professional discovers what Bjork's research predicts: the retrieval strength belonged to the tool, and the storage strength was never built.
Organizations face a version of this problem at a structural level. The team that adopts AI tools sees an immediate productivity increase. Features ship faster. Coverage expands. The output metrics improve across every measurable dimension. The organization, evaluating by output, concludes that AI adoption is an unqualified success and expands its deployment.
What the organization does not measure is the learning trajectory of its workforce. Are the junior members developing toward senior capability? Are the mid-career professionals deepening their domain expertise? Is the organization accumulating the human capital — the judgment, the architectural intuition, the diagnostic reasoning — that cannot be stored in a tool and that differentiates a capable organization from a productive one?
These questions require measurement instruments that most organizations do not possess. Performance reviews evaluate output. Quarterly metrics evaluate throughput. Promotion decisions evaluate visible contribution. None of these instruments capture the distinction between a professional whose capability is growing and a professional whose performance is growing while capability stagnates — because performance, augmented by the tool, looks identical in both cases.
Bjork's framework suggests specific organizational interventions that would preserve the desirable difficulties through which professional expertise develops. Structured AI-free practice periods would function as cognitive training sessions — not punishment, not Luddism, but the professional equivalent of an athlete's unassisted workout. The sprinter who trains with resistance bands is not rejecting the sport. She is building the strength that makes her faster when the bands are removed. The developer who debugs without AI for a designated period each week is not rejecting the tool. She is building the storage strength that makes her more capable when the tool is available.
Progressive challenge frameworks would increase the complexity of unassisted tasks as competence grows. The junior developer begins with AI assistance on most tasks, with periodic unassisted challenges calibrated to her current level. As her independent capability develops — measured through dependency audits, not performance reviews — the ratio of unassisted to assisted tasks increases. The trajectory is not toward eliminating AI assistance but toward building a foundation of independent capability that makes AI assistance more productive rather than merely more efficient.
Evaluation systems redesigned around Bjork's performance-learning distinction would assess not only what the professional produces but what the professional understands. This requires evaluation instruments that go beyond output review: technical interviews conducted periodically without AI access, architectural discussions in which the professional must reason independently, diagnostic exercises in which the professional must identify and resolve problems using only their own knowledge base. These assessments measure storage strength — the dimension of capability that AI augments but does not build.
Mentoring structures constitute perhaps the most important organizational intervention. The transfer of tacit knowledge — the judgment, the intuition, the pattern recognition that experienced professionals possess and struggle to articulate — occurs through a specific kind of interaction: slow, friction-rich, iterative conversation in which the mentor and the mentee wrestle together with problems that do not have clean solutions. AI can facilitate parts of this process. It cannot replace the interpersonal dynamic through which tacit knowledge is transmitted.
The senior engineer who sits with a junior colleague for an hour, working through a system design problem, asking questions rather than providing answers, letting the junior colleague struggle with options and trade-offs rather than resolving them — that senior engineer is creating desirable difficulty. The interaction feels less efficient than pointing the junior colleague toward an AI tool that will produce a working design in minutes. The interaction produces learning that the AI tool will not.
The economic tension is real and should not be minimized. Organizations that implement desirable-difficulty protocols will see short-term productivity costs. The developer who spends an hour debugging independently before consulting Claude produces less output that day than the developer who consults Claude immediately. The team that allocates time for unassisted practice sessions produces less total output that week than the team that operates at full AI-augmented capacity. The quarterly numbers will reflect the cost.
The return on the investment is measured in years, not quarters. The organization that invests in the storage strength of its workforce is investing in a capability that no tool can provide and no competitor can replicate: the judgment, the depth, the independent reasoning capacity that distinguishes a team of professionals from a team of tool operators. Segal, in The Orange Pill, describes this choice in economic terms — the decision to keep and grow the team rather than converting the twenty-fold productivity gain into headcount reduction. Bjork's framework provides the cognitive rationale: the team whose storage strength is growing is a team becoming more capable. The team whose retrieval strength is maintained entirely by the tool is a team that has outsourced its capability to a vendor.
The outsourcing is invisible in the performance metrics. It becomes visible only in the dependency audit — the moment when the tool is removed and the organization discovers what its people can actually do.
The professional's treadmill is this: the tool makes work feel better and more productive while potentially making the worker less independently capable. The treadmill runs faster. The runner looks stronger. The muscles may be atrophying underneath the performance — visible only when the machine stops and the runner must stand on the ground that was always there, waiting.
---
This analysis began with a paradox and arrives at a choice.
The paradox is clear: the conditions that produce the best immediate performance are often the conditions that produce the worst long-term learning, and the most powerful learning technology in human history is, by default, optimized for performance at the expense of learning. The evidence for this paradox is not speculative. It is the product of four decades of controlled experiments, replicated findings, and a theoretical framework that has survived every challenge the field of cognitive psychology has offered.
The choice is equally clear, though far harder to execute: whether a civilization that has built the most powerful ease-producing technology in its history will deliberately preserve the difficulties that built its expertise.
Segal, in The Orange Pill, proposes the metaphor of the beaver — the small, persistent creature that builds structures to redirect a river too powerful to stop. The dam does not fight the current. It shapes it, creates pools where life can take root, transforms the raw force of the water into an ecosystem that supports hundreds of species that could not survive in the unimpeded flow. The beaver builds and maintains, day after day, stick by stick, because the river never stops pushing.
Bjork's research specifies what the dam must be made of. Not ideology. Not nostalgia for a pre-digital world. Not the philosopher's garden, however admirable its contemplative quiet. The dam must be built from evidence — from the specific, replicated, empirically grounded findings about how human beings actually learn, actually retain, actually develop the flexible and transferable expertise that no tool can substitute for and no society can afford to lose.
The materials are known. Spacing — the distribution of practice across time, with gaps that allow retrieval strength to decay and force the effortful retrieval that builds storage strength. Interleaving — the mixing of problem types that forces discrimination and produces the flexible categorization that blocked practice never develops. Generation — the requirement that the learner produce before receiving, ensuring that the cognitive traversal which builds associative networks is not bypassed by the fluency of AI-provided solutions. Delayed feedback — the gap between action and evaluation that forces self-assessment and builds the metacognitive accuracy without which self-directed learning is impossible.
These are not principles. They are engineering specifications. They tell us what must be preserved, where it must be preserved, and what happens when it is not. The bridge between laboratory findings and civilizational practice is long, and the translation is not straightforward. But the direction is unambiguous: AI tools must be designed, deployed, and used in ways that preserve the cognitive operations through which human expertise develops.
The specification has three layers: tool design, institutional policy, and individual practice.
At the tool-design layer, the specification calls for AI systems that incorporate desirable difficulty by default rather than eliminating it. Generate-first protocols that require user attempts before providing solutions. Partial scaffolding that calibrates assistance to the user's developmental level. Spaced-assistance features that introduce productive gaps. Confidence-calibration interfaces that train metacognitive accuracy. Dependency audits that periodically measure independent capability. These features are technically feasible. They have been demonstrated in research prototypes. They await only the commercial will to prioritize long-term human development alongside short-term user satisfaction.
The commercial tension is genuine, and dismissing it would be dishonest. The market selects for ease. A tool that delays its response, provides partial answers, periodically withdraws, and asks users to rate their own confidence will lose market share to a tool that responds instantly, provides complete solutions, is always available, and affirms the user's sense of mastery. Every design choice that respects the learning science is a design choice that the market punishes. This is the structural version of the paradox — the conditions that produce the best long-term outcomes for the user are the conditions that produce the worst short-term metrics for the product.
The resolution at this layer requires either regulation that mandates learning-preserving features in educational and professional AI tools, or a market shift in which users, informed by the evidence, choose tools that invest in their development rather than tools that merely enhance their performance. The first is feasible but slow. The second is desirable but depends on a level of metacognitive sophistication that the tools themselves tend to erode.
At the institutional layer, the specification calls for organizations — companies, schools, universities, professional training programs — to build desirable-difficulty structures into their AI deployment policies. Structured AI-free practice periods. Progressive challenge frameworks. Evaluation systems that measure independent capability alongside AI-augmented output. Mentoring structures that preserve the friction-rich transfer of tacit knowledge. Dependency audits administered not punitively but diagnostically.
These structures require institutional leaders who understand the performance-learning distinction and who are willing to accept short-term productivity costs in exchange for long-term capability development. The leader who insists on dependency audits will face resistance from employees who find them threatening, from managers who see them as overhead, and from boards who evaluate the organization on quarterly output. The resistance is predictable and must be addressed not with authority but with evidence — the same evidence that Bjork has been presenting to educators for forty years, now reframed for organizational leaders who face the same paradox in different terms.
At the individual layer, the specification calls for a form of cognitive self-governance that is harder than any institutional policy because it requires choosing difficulty when ease is always available. The developer who pauses before prompting Claude, who spends fifteen minutes generating her own approach before consulting the tool, who tracks her independent capability over time and adjusts her AI usage when the trajectory points downward — this developer is building her own dam, stick by stick, against a current that never stops.
The individual choice is the hardest because it receives no external reinforcement. No organization rewards the developer for her private fifteen minutes of struggle. No performance review captures the storage strength she built by generating before delegating. No quarterly metric reflects the metacognitive accuracy she developed by calibrating her confidence against her actual independent performance. The choice to struggle is a choice made in solitude, for benefits that are invisible to everyone except the person who makes the choice — and often invisible even to her, because the benefits compound slowly, over months and years, while the cost is immediate and felt with every effortful retrieval attempt.
And yet the evidence says this is the choice that matters. Not the institutional policy, though policies help. Not the tool design, though design matters. The individual's willingness to choose the harder path — to override the metacognitive signal that says ease is evidence of learning, to resist the fluency that feels like understanding but is not, to trust the science when the science contradicts every intuition the brain produces — is the unit of the dam. Each individual choice to generate before receiving, to struggle before delegating, to space practice rather than mass it, is one stick placed against the current. The accumulation of those choices, across millions of individuals, is what determines whether the river produces an ecosystem or a flood.
Bjork has been asking students to make this choice for four decades. In classrooms at UCLA, in lectures around the world, in papers and books and interviews, the message has been consistent: the easy path feels right and works poorly; the hard path feels wrong and works well. Most students have not listened. The easy path is too seductive, the metacognitive illusion too convincing, the short-term reward too immediate.
But the stakes have never been this high. The difference between the easy path and the hard path, in the age of AI, is not the difference between an A and a B on next week's exam. It is the difference between a generation that develops expertise and a generation that develops dependency. Between professionals who understand their domains and professionals who operate tools that understand for them. Between a civilization that uses the most powerful cognitive technology in its history to become genuinely more capable and a civilization that uses it to feel more capable while the capability quietly migrates from the person to the machine.
The evidence is clear. The paradox is documented. The mechanism is understood. The engineering specifications are written. What remains is the choice — made not once but daily, not by a committee but by each person who sits down at a screen and decides whether to struggle first or delegate immediately.
The evidence says struggle first. The market says delegate. The metacognitive monitor says delegation feels like mastery.
The monitor is wrong. The market is measuring the wrong thing. The evidence has been right for forty years.
The dam is built one stick at a time, by anyone willing to trust the science over the feeling. The river does not wait. The river does not care whether the dam holds or breaks. The river carries whatever the current brings — shallow fluency or deep understanding, performance or learning, the feeling of knowing or the reality of it.
The sticks are there. The evidence says where to place them. The hands that build the dam are yours.
---
The experiment I kept running on myself produced results I did not want.
In the middle of writing The Orange Pill, I described a night somewhere over the Atlantic — writing for hours, unable to stop, the exhilaration having drained away and the grinding compulsion taking its place. I knew, in that moment, that I was confusing productivity with aliveness. I described it honestly because the book demanded honesty.
What I did not understand then, and what Bjork's research made painfully clear, is the mechanism by which that compulsion operates on cognition itself. I was not simply working too long. I was working in a mode that felt like deep understanding because the prose was flowing, because Claude was producing connections I had not seen, because the output was accumulating at a rate that my metacognitive monitor interpreted as evidence of genuine intellectual work.
The fluency was real. The feeling of mastery was real. And Bjork's forty years of experiments say, with the weight of replicated evidence, that real fluency producing real feelings of mastery can coexist with shallow encoding — with understanding that lives on the surface and has not been deposited, layer by layer, through the struggle that builds storage strength.
I think about my engineer in Trivandrum — the one who, months after we removed the "plumbing" from her daily workflow, realized she was making architectural decisions with less confidence. She could not explain why. Bjork explains why. The plumbing contained desirable difficulties. Not the tedium — that was genuinely unproductive and good riddance. But woven into the tedium were the spacing effects of returning to configuration problems across days, the interleaving of different system types, the generation events of constructing solutions from her own knowledge rather than receiving them from a tool. Those ten minutes of formative struggle hidden inside four hours of drudgery were where the storage strength was being built. The tool removed both. She noticed only the relief.
This is the finding I cannot stop turning over. Not all friction is productive. But the productive friction is often invisible, embedded in the drudgery, indistinguishable from waste until it has been removed and its absence manifests as a mysterious loss of confidence in decisions that used to feel sure.
I do not intend to stop using Claude. That would be precisely the wrong lesson to draw from this work. The productivity gains are real. The capability expansion is genuine. The things my team builds now could not have existed twelve months ago. But I have started doing something that Bjork's research says matters: I struggle first. Before I open the conversation with Claude, I sit with the problem. I write my own rough version. I generate my hypothesis, my outline, my approach — incomplete, messy, often wrong — and I let the wrongness do its cognitive work before I hand the problem to a system that will solve it fluently.
Fifteen minutes. That is the cost. Fifteen minutes of private struggle that produces an inferior preliminary result. The quarterly review will never see those fifteen minutes. No productivity metric captures them. They are invisible to every measurement system that governs the modern economy.
But those fifteen minutes are where the storage strength is built. They are the generation event. The retrieval attempt. The traversal of my own knowledge network that deposits another thin layer of understanding. They are the stick placed against the current.
I think about my son's question at dinner — whether AI was going to take everyone's jobs. I told him the truth: the jobs will evolve. They will ascend. But I did not tell him the harder truth that Bjork's work demands: the ascent does not happen automatically. It happens only through the engagement with difficulty that produces genuine capability. Skip the difficulty, and you skip the ascent. The job title changes. The expertise does not form.
When I tell my children to care about their questions more than their answers, I am saying something that sounds philosophical but is, I now understand, grounded in the hardest cognitive science available. The question is a generation event. It forces the mind to map the boundary between what it knows and what it does not. That mapping is itself a learning event of the highest order. The answer, received from a machine, is a reception event. It fills the space that the question opened — but the filling does not build what the opening built.
Bjork has been right for forty years. The hard path works better. The easy path feels better. And the distance between those two sentences is the distance that will determine whether the most powerful cognitive technology in human history makes us genuinely more capable or merely more productive.
I know which I am choosing. Fifteen minutes at a time.
-- Edo Segal
The AI revolution promises frictionless productivity -- instant answers, effortless code, polished output at the speed of thought. But four decades of cognitive science reveal a devastating paradox at the heart of that promise: the ease that makes us feel most capable may be making us least so.
Robert A. Bjork's research on desirable difficulties -- the finding that struggle, spacing, and effortful retrieval are not obstacles to learning but the mechanisms through which durable expertise is built -- is the most consequential science the technology industry has never absorbed. In this installment of The Orange Pill series, Edo Segal examines what Bjork's evidence means for a civilization building its future on tools designed to eliminate every form of cognitive friction.
When the amplifier removes the resistance that shaped the signal, what happens to the signal? The answer is not theoretical. It is the most urgent practical question facing every builder, educator, parent, and professional navigating the age of AI.

A reading-companion catalog of the 20 Orange Pill Wiki entries linked from this book — the people, ideas, works, and events that Robert A. Bjork — On AI uses as stepping stones for thinking through the AI revolution.
Open the Wiki Companion →