
The cycle that [YOU] on AI opens is centrally concerned with the gap between what AI systems appear to do and what they actually do—the decorrelation of fluency from authority that makes contemporary AI so easy to overtrust. Wordsmiths in the dark is Li's formulation of that gap from the perspective of embodiment. The systems that produce such compelling language have done so by learning from text, and text is a shadow of the world, not the world itself. The shadow can be extraordinarily detailed. It remains a shadow. A system trained on shadows cannot, from shadows alone, derive what cast them.
The concept connects Li's technical work to a philosophical tradition that reaches back to the grounding problem: the question of how symbols, which are purely formal objects, acquire their connection to the things they refer to. For living creatures the grounding is experiential and embodied—the word "red" is anchored, in the end, by the experience of seeing red, which no amount of text can substitute for. For a language model, the anchoring is purely statistical: "red" is defined by its co-occurrences with "fire," "blood," "stop," and a million other words. The system can navigate the statistical structure with remarkable skill. Whether that navigation constitutes anything like understanding is exactly the question the wordsmith-in-the-dark image is designed to hold open.
The image crystallizes from Li's lifelong attention to the difference between what machines can do and what they understand. After the 2012 ImageNet breakthrough established that machines could recognize objects in images with superhuman accuracy on certain benchmarks, Li pressed the question of what recognition actually meant. A system that labels a photograph's contents is not understanding the scene in the sense a viewer does—it is not grasping the causal structure, the affordances, the narrative that a human brings to the same image. The machines had become superb at the surface and left the interior untouched.
The wordsmith-in-the-dark formulation arrives in Li's mature work as a diagnosis of why language model fluency, however impressive, does not settle the question of machine understanding. It draws on an evolutionary argument: intelligence, in her account, did not begin with language but with sensing and acting—with the perception-action loop that drove biological cognitive evolution for hundreds of millions of years before anything like language appeared. A system that enters the cognitive picture at the level of language, without passing through the evolutionary substrate of embodied perception and action, is starting very late in the story and missing the foundation that makes language meaningful in the first place.
Fluency without grounding. The wordsmith-in-the-dark image captures a specific and consequential structure: a system can be arbitrarily fluent—can produce grammatically perfect, contextually appropriate, stylistically sophisticated text—without being grounded in the world the text refers to. This is not a limitation of current systems that more training data will fix; it is a structural feature of learning from text alone. The symbol grounding problem does not dissolve when the symbol-manipulating system becomes very large.
The dark as absence of the perception-action loop. What is missing in the dark is not merely sensory input—some multimodal systems receive images and audio—but the full perception-action loop: the ongoing cycle of perceiving a physical situation, acting on it, perceiving the consequences of the action, and updating a model of the world accordingly. This loop is, in Li's analysis, the evolutionary engine of intelligence and the source of embodied understanding. A system that receives perceptual inputs without taking actions in a physical world is not closing the loop; it is receiving shadows of a causal structure it never directly contacts.
Spatial intelligence as the light. Li's proposed solution is not to make language models bigger but to develop spatial intelligence—generative systems that can produce and reason about consistent three-dimensional worlds and that learn, through the perception-action loop, the causal structure of physical space. A system that can genuinely navigate, manipulate, and predict the behavior of a physical world has turned the light on in the way that a wordsmith alone cannot. Whether spatial grounding is sufficient for the kind of understanding that language models lack, or whether further dimensions of biological embodiment are required, remains an open research question that Li's work at World Labs is designed to press.