The performance-learning distinction is the most important conceptual separation in Bjork's framework and the one most systematically violated by contemporary evaluation systems. Performance is observable behavior during training—accuracy, speed, output quality, visible competence under the conditions in which practice occurs. Learning is the change in long-term capability that supports retention after delays and transfer to novel contexts. The two can align, but under many conditions they diverge or invert: practices that maximize performance (massed study, blocked problems, immediate feedback, easy retrieval) often minimize learning, while practices that maximize learning (spaced study, interleaved problems, delayed feedback, effortful retrieval) often suppress performance during training. This inversion produces a systematic evaluation error: if you measure only performance, you are measuring the wrong thing when your goal is learning. AI tools are performance-maximizing engines that operate without regard for learning consequences, because learning cannot be measured in the moment and performance can.
The distinction is not merely academic; it is the explanation for why educational and organizational systems consistently adopt practices that the evidence shows to be ineffective. Teachers evaluate students on tests administered immediately after study—performance measures that favor massed practice and penalize the spacing that produces better long-term learning. Organizations evaluate workers on quarterly output—performance metrics that favor AI-augmented throughput and ignore the storage-strength development that determines independent capability years later. The evaluation systems are not wrong about what they measure; they are measuring the wrong thing if the goal is durable expertise rather than current productivity. The short-term metric is easier to capture, more directly observable, and more satisfying to stakeholders. The long-term metric requires delayed testing, comparison to independent baselines, and the patience to wait for benefits that compound slowly across months and years.
AI's impact on this distinction is to widen the gap between the two measures to unprecedented breadth. Before AI, the correlation between performance and learning was imperfect but positive: the student who performed well on the immediate test generally learned more than the student who performed poorly. AI breaks this correlation. A student using ChatGPT can produce an essay that performs excellently—well-structured, fluent, accurate—while learning nothing, because the tool performed all the cognitive operations (generation, organization, articulation) through which learning occurs. The performance is real; the learning is absent. And because institutions measure performance almost exclusively, the absence of learning produces no institutional signal. The student passes. The grade is recorded. The curriculum moves forward. The learning never happened.
The developer using Claude Code is subject to the same dissociation at industrial scale. Her performance—measured in features shipped, bugs resolved, code reviews passed—is excellent and accelerating. Her learning—the accumulation of diagnostic patterns, architectural intuitions, and debugging heuristics that would support independent reasoning when the tool is unavailable—may be stagnating or declining. The performance metrics that govern her career capture the first trajectory and miss the second entirely. The gap is invisible until the dependency audit: the moment when the tool is removed and independent capability is assessed. Only then does the distinction between what she can do with the tool and what she has learned become measurable.
The prescription emerging from this distinction is a measurement reform more fundamental than any pedagogical intervention: evaluate learning, not merely performance. This requires delayed testing without AI assistance, comparison of independent capability before and after AI adoption, longitudinal tracking of storage strength separate from retrieval strength. The measurement is harder, slower, and less satisfying to administer than a performance metric. It also captures the dimension that determines whether AI users are becoming genuinely more capable or merely more productive—a distinction with different implications for individuals, organizations, and civilizations. The first produces augmented humans. The second produces human-tool hybrids whose capability exists only in the coupling and dissolves when the coupling breaks.
The distinction has been implicit in learning research since the behaviorist era—the difference between acquisition and retention was always recognized. But the explicit theoretical elaboration and empirical demonstration of their frequent opposition emerged through the desirable-difficulties research program of the 1980s and 1990s. Bjork formalized the distinction in numerous papers, showing that experimental conditions could be systematically manipulated to maximize one measure while minimizing the other, and that educators and learners consistently optimized for performance at the expense of learning because performance was what they could observe and learning was not.
Two dissociable measures. Performance (current capability under training conditions) and learning (durable capability supporting retention and transfer) are distinct dimensions that can move independently and often move in opposite directions.
Conditions often invert the relationship. Practices maximizing performance (massed, blocked, immediate feedback) frequently minimize learning; practices maximizing learning (spaced, interleaved, delayed feedback) frequently suppress training performance.
Institutions measure performance almost exclusively. Schools, workplaces, and markets evaluate observable current output, systematically underweighting the learning dimension that determines capability months and years later.
AI widens the gap catastrophically. Tools that maximize user performance while potentially eliminating the cognitive operations through which learning occurs can produce unprecedented divergence between the two measures—high output, low understanding.
Delayed testing without tools is the honest measure. The only reliable assessment of learning is independent performance after a retention interval, revealing whether capability persists when the tool is removed.