CONCEPT

The Evaluation Gap

Boden’s implied hierarchy of creative judgment—functional, aesthetic, and directional—and the growing distance between the machine’s generative range and the human evaluator’s capacity to assess what the machine finds.

When a machine can generate creative candidates faster and across wider conceptual terrain than any human, the critical question shifts from production to judgment: who evaluates what the machine finds, and with what depth of understanding? Boden’s taxonomy of exploratory, combinational, and transformational creativity implies a matching three-level hierarchy of evaluation: the functional level (does the output work against specified criteria?), the aesthetic level (is it excellent, not merely adequate?), and the directional level (should this exist at all, and is this the right problem?). Each level requires a qualitatively different kind of understanding, and each is built through forms of personal engagement that devices are structurally designed to eliminate. The evaluation gap names the growing asymmetry: the machine’s combinational range now exceeds any individual human’s evaluative range, meaning the machine can connect domains the human cannot properly evaluate. This gap does not make the machine-human collaboration less productive; it makes the human’s evaluative depth—built through the friction of focal practice—the most critical and most threatened resource in the AI transition.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI surfaces the evaluation gap most vividly in two diagnostic episodes from Segal’s own practice. In the first, Claude produced a passage connecting Csikszentmihalyi’s flow state to Gilles Deleuze’s concept of “smooth space” as the terrain of creative freedom. The passage was rhetorically elegant. Segal read it twice, liked it, and moved on. The next morning, something nagged. He checked: Deleuze’s concept of smooth space had almost nothing to do with how Claude had used it. The passage passed functional evaluation (it was well-formed and internally consistent) and appeared to pass aesthetic evaluation (it sounded like insight). But it failed the evaluative test that required actual philosophical knowledge of Deleuze—knowledge Segal possessed. The gap was visible. In domains where his knowledge was thinner, it would not have been.

In the second episode, Claude produced a passage about democratization that Segal almost accepted. He did not because he realized he could not tell whether he actually believed the argument or merely liked how it sounded. “The prose had outrun the thinking.” He deleted it and spent two hours at a coffee shop with a notebook, writing by hand until he found the version that was his—rougher, more qualified, more honest about what he did not know. This is directional evaluation operating in real time: not the assessment of a candidate against specified criteria, but the assessment of the criteria themselves.

The cycle’s treatment of the ten minutes of formative surprise embedded in four hours of plumbing illuminates why the evaluation gap compounds over time. Those moments were not valuable for what they produced at the time. They were valuable because they were the deposition events that built the practitioner’s evaluative capacity at the aesthetic level—the intuitive sense of what is right that can only be built through accumulated personal exploration of a space. When Claude eliminated the four hours, it eliminated the ten minutes as well. The practitioner’s generative and functional performance improved immediately. Her evaluative depth eroded gradually, invisibly, until months later she realized she was making architectural decisions with less confidence and could not explain why.

Origin

The evaluation gap is not a concept Boden named explicitly but one that her taxonomy implies with logical force. Her 1990 analysis of Harold Cohen’s AARON program identified the fundamental asymmetry: Cohen, who had spent decades developing his own evaluative sense as a painter, could judge whether AARON’s outputs were excellent or merely adequate. AARON could not. It generated candidates; he evaluated them. The quality of the collaborative output depended entirely on the depth of Cohen’s evaluative capacity—built through years of focal engagement with painting that the program had not undergone and could not have undergone.

The gap became more consequential as the machine’s combinational range expanded beyond any individual human’s evaluative reach. A program trained on one aesthetic domain requires its human collaborator to be expert in that domain. A program trained on the entire documented range of human knowledge can connect domains that no individual expert commands simultaneously. The human can evaluate the connections in domains where she is expert. In other domains, she may lack the knowledge to detect the seam where a plausible-sounding combination breaks. The machine’s range has grown faster than any individual’s evaluative reach can plausibly expand.

The three levels of evaluation emerge naturally from Boden’s taxonomy. Functional evaluation corresponds to the evaluative component of exploratory creativity: checking whether an output occupies a position the rules of the conceptual space permit. Aesthetic evaluation corresponds to the evaluative component of combinational creativity: judging whether a connection is genuinely illuminating rather than merely plausible-sounding—which requires the taste that can only be built through extensive personal engagement with the relevant domains. Directional evaluation corresponds to the evaluative component of transformational creativity: questioning whether the framework within which one is working is adequate at all.

Key Ideas

Three levels of evaluation. Functional evaluation checks whether an output meets specified criteria: does the code compile, does the text communicate its intended meaning, does the design satisfy stated requirements? AI systems are increasingly capable at this level and can serve as their own first-order evaluators. Aesthetic evaluation assesses quality beyond specification: is the output excellent, not merely adequate, genuinely illuminating rather than merely plausible? This requires taste—the embodied sense of excellence built through extensive personal exploration of a domain. Directional evaluation asks whether the framework itself is adequate: is this the right problem, is this the right space, should this exist at all? This is the evaluative component of transformational creativity and is the level furthest from any machine.

Evaluative capacity requires personal exploration. The aesthetic evaluation of creative output in a domain requires understanding that can only be built through extensive personal exploration of the domain’s conceptual space. The practitioner who has explored a space—who has navigated its resistances, encountered its failure modes, developed intuitions about its topology through friction-rich engagement—can recognize both excellent and hollow outputs. The practitioner who has only received the outputs of a machine’s search of the space cannot, in the same way, because the understanding required for evaluation is built through exploration, not through the receipt of results. This is the mechanism by which geological understanding builds: layer by layer, through the specific resistance of a domain that does not do what was expected.

The gap compounds. The evaluation gap is self-reinforcing. As practitioners delegate more of their exploratory work to machines, the deposition of evaluative capacity slows. As evaluative capacity slows, practitioners become more dependent on machine evaluation at the aesthetic level—accepting what the machine produces because they lack the depth to distinguish excellent from merely adequate. As this dependence deepens, the incentive to maintain personal exploratory practice further diminishes. The machine’s generative range continues to expand while the human’s evaluative reach atrophies. Ascending friction is the partial antidote: the claim that AI relocates difficulty to a higher cognitive floor, where the evaluative demands are more consequential. The antidote works only for those who actually engage with the higher-level friction.

Directional evaluation is irreducibly human. The highest level of the evaluation hierarchy—whether the problem is the right problem, whether the space should be transformed—requires having purposes of one’s own, caring about whether those purposes are adequate, and being able to imagine alternatives that do not yet exist. This is the evaluative analogue of transformational creativity, and it presupposes a kind of agency that current machines do not possess. The builder who asks “Should this be built at all?” is performing directional evaluation. The machine will help build whatever it is asked to build. The asking, and the quality of the asking, is the distinctly human contribution.

Explore more

Browse the full You On AI Field Guide — over 8,500 entries