CONCEPT

Wordsmiths in the Dark

Fei-Fei Li's characterization of large language models as eloquent but ungrounded—systems that have ingested enormous quantities of human text and can recombine it with startling facility, but that have never perceived a physical world or closed the perception-action loop that living creatures use to ground meaning.

The phrase is Fei-Fei Li's and its sharpness comes from precision, not provocation. A wordsmith is a maker of language, someone with genuine skill over the medium; in the dark means without the embodied experience of a world to which the words refer. Large language models are, by Li's analysis, exactly this: extraordinarily capable at the surface of language—at producing, recombining, and completing text in ways that match or exceed human fluency—while fundamentally lacking the experiential grounding that gives language its connection to reality in living creatures. A child learning the word "staircase" learns it alongside the falls, the effort of climbing, the proprioceptive memory of where the next step is; she learns it in a world that pushes back. A language model learns the word from its co-occurrences with other words in text: it knows everything that has been written about staircases and cannot, from that knowledge alone, derive the embodied understanding that makes a toddler, after a few falls, reliable on stairs. The wordsmith has the names. The dark is the absence of the things the names name—the physical, three-dimensional, causally structured world that spatial intelligence navigates and that language, in humans, has always been grounded in. Li's entire research program at World Labs can be read as the project of turning the lights on—of building systems that do not just predict text but perceive, model, and act within physical space.

In the [YOU] on AI Field Guide

The cycle that [YOU] on AI opens is centrally concerned with the gap between what AI systems appear to do and what they actually do—the decorrelation of fluency from authority that makes contemporary AI so easy to overtrust. Wordsmiths in the dark is Li's formulation of that gap from the perspective of embodiment. The systems that produce such compelling language have done so by learning from text, and text is a shadow of the world, not the world itself. The shadow can be extraordinarily detailed. It remains a shadow. A system trained on shadows cannot, from shadows alone, derive what cast them.

The concept connects Li's technical work to a philosophical tradition that reaches back to the grounding problem: the question of how symbols, which are purely formal objects, acquire their connection to the things they refer to. For living creatures the grounding is experiential and embodied—the word "red" is anchored, in the end, by the experience of seeing red, which no amount of text can substitute for. For a language model, the anchoring is purely statistical: "red" is defined by its co-occurrences with "fire," "blood," "stop," and a million other words. The system can navigate the statistical structure with remarkable skill. Whether that navigation constitutes anything like understanding is exactly the question the wordsmith-in-the-dark image is designed to hold open.

Origin

The image crystallizes from Li's lifelong attention to the difference between what machines can do and what they understand. After the 2012 ImageNet breakthrough established that machines could recognize objects in images with superhuman accuracy on certain benchmarks, Li pressed the question of what recognition actually meant. A system that labels a photograph's contents is not understanding the scene in the sense a viewer does—it is not grasping the causal structure, the affordances, the narrative that a human brings to the same image. The machines had become superb at the surface and left the interior untouched.

The wordsmith-in-the-dark formulation arrives in Li's mature work as a diagnosis of why language model fluency, however impressive, does not settle the question of machine understanding. It draws on an evolutionary argument: intelligence, in her account, did not begin with language but with sensing and acting—with the perception-action loop that drove biological cognitive evolution for hundreds of millions of years before anything like language appeared. A system that enters the cognitive picture at the level of language, without passing through the evolutionary substrate of embodied perception and action, is starting very late in the story and missing the foundation that makes language meaningful in the first place.

Key Ideas

Fluency without grounding. The wordsmith-in-the-dark image captures a specific and consequential structure: a system can be arbitrarily fluent—can produce grammatically perfect, contextually appropriate, stylistically sophisticated text—without being grounded in the world the text refers to. This is not a limitation of current systems that more training data will fix; it is a structural feature of learning from text alone. The symbol grounding problem does not dissolve when the symbol-manipulating system becomes very large.

The dark as absence of the perception-action loop. What is missing in the dark is not merely sensory input—some multimodal systems receive images and audio—but the full perception-action loop: the ongoing cycle of perceiving a physical situation, acting on it, perceiving the consequences of the action, and updating a model of the world accordingly. This loop is, in Li's analysis, the evolutionary engine of intelligence and the source of embodied understanding. A system that receives perceptual inputs without taking actions in a physical world is not closing the loop; it is receiving shadows of a causal structure it never directly contacts.

Spatial intelligence as the light. Li's proposed solution is not to make language models bigger but to develop spatial intelligence—generative systems that can produce and reason about consistent three-dimensional worlds and that learn, through the perception-action loop, the causal structure of physical space. A system that can genuinely navigate, manipulate, and predict the behavior of a physical world has turned the light on in the way that a wordsmith alone cannot. Whether spatial grounding is sufficient for the kind of understanding that language models lack, or whether further dimensions of biological embodiment are required, remains an open research question that Li's work at World Labs is designed to press.

Debates & Critiques

The wordsmith-in-the-dark diagnosis provokes two main counterarguments. The first holds that text is a richer encoding of the world than Li's image suggests: language, accumulated over millennia of human experience, contains an enormous quantity of implicit structural knowledge about physical causality, spatial relationships, and affordances. A system trained on sufficient text may derive a functional model of the world that is adequate for many purposes, even without direct embodied experience. The second, stronger counterargument holds that emergent capabilities in very large models may include forms of grounding that were not explicitly trained—that scale itself turns on some of the lights. Li's response to both is empirical: the characteristic failures of current systems—brittleness about physical plausibility, inability to reason reliably about spatial relationships, the confabulation of physical details that any embodied creature would get right—suggest that the dark remains substantially dark, and that the path to illuminating it runs through the embodied perception-action loop rather than through further scaling of text prediction.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

Related Entries

Further Reading