CONCEPT

World Model Architecture

Yann LeCun’s proposed foundation for genuine machine intelligence: an internal simulation of how reality behaves that would allow a system to predict, plan, and reason about the consequences of its actions rather than merely completing patterns in text.

The world model is the missing component that separates pattern-matching from understanding, in Yann LeCun’s diagnosis of where current AI systems fall short. A system with a world model carries within itself a working simulation of how its environment behaves: given a current state, the model predicts what will happen next, and given an intended action, it predicts how the state would change. This capacity for internal prediction enables planning—the ability to imagine the consequences of an action before taking it, to search through possible futures for the one that best serves a goal, to reason about cause and effect rather than retrieve text that sounds like causal reasoning. Large language models can describe how a dropped glass breaks without modeling it—they have absorbed vast quantities of text about falling objects and can generate statistically plausible sentences about them—but they do not simulate the falling glass, and the difference is not academic. A system with a world model would not hallucinate physical impossibilities, would not fail when a task departs from its training distribution, would know the shape of its own ignorance. LeCun’s 2022 position paper, A Path Towards Autonomous Machine Intelligence, proposes an architecture organized around a world model with complementary modules for perception, cost (motivation), memory, and action, tied together by a configurator that focuses the system on whatever task is at hand. The proposed architecture is explicitly a research program rather than a finished design: he is candid that training a world model capable of predicting at multiple levels of abstraction from high-dimensional sensory data remains largely unsolved. But the concept is his central claim about what intelligence requires, and it provides the framework against which he judges every other approach in the field.

In the [YOU] on AI Field Guide

The cycle’s account of the amplifier framework takes a specific form when read through LeCun’s world-model lens: what current AI systems amplify is not understanding but the retrieval of patterns that resemble understanding in the training data. The writer using a language model is amplified by a system that has extracted statistical structure from descriptions of the world without having a model of the world; the scientist using an AI research tool is working with a system that can surface correlations without understanding mechanisms. The world model concept names what is absent from the collaboration—the internal simulation that would make the tool’s outputs reliably grounded in how things actually work rather than how they are typically described.

The cycle also finds in the world model concept a frame for the fluency-authority decorrelation: a system without a world model has no internal check against which to verify its outputs. It generates text that sounds like what a system with a world model would generate, without the model that would make the text reliable. The hallucination problem is not a bug in this framework; it is the direct expression of a system that optimizes for plausibility rather than for truth, because it has no representation of truth independent of its training data.

Origin

The concept of an internal world model is not LeCun’s invention—it traces to Kenneth Craik’s 1943 The Nature of Explanation, which proposed that intelligent creatures carry “small-scale models” of reality in their heads, and to the tradition of model-based reinforcement learning and model predictive control in robotics. LeCun’s contribution is to synthesize this tradition with his energy-based modeling framework and his program of self-supervised learning from high-bandwidth sensory data, and to argue that the world model is the specific piece that the dominant AI paradigm has systematically failed to build.

His 2022 position paper, A Path Towards Autonomous Machine Intelligence, is the most complete statement of the argument. It proposes a hierarchical world model in which higher levels predict abstract, long-range outcomes while lower levels fill in concrete, short-range details—allowing the system to plan at the level of goals and intentions rather than individual motor commands. The paper is explicitly tentative, describing unsolved problems at every level of the architecture, which LeCun himself regards as evidence of intellectual honesty rather than incompleteness.

Key Ideas

Prediction before action. The defining capacity of a world model is the ability to run consequences forward before committing to them. This is what distinguishes planning from reaction: a reactive system maps an input to an output in a single forward pass; a planning system uses its world model to search through possible action sequences for the one that best achieves its goal. Autoregressive language models are reactive by construction; they produce one token at a time without deliberating. The world model is the architectural requirement for a system that can deliberate.

Hierarchical abstraction. The world operates at many scales of time and abstraction; a single flat predictor cannot handle all of them. LeCun proposes a hierarchy of world models in which each level predicts over a longer time horizon and coarser representation than the one below it. The lowest level handles immediate sensorimotor details; higher levels reason about goals and intentions. This hierarchy is what allows a system to plan a multi-step task without needing to specify every physical detail of every step—analogous to how a person decides to walk to the kitchen without consciously planning each muscle contraction.

The Joint Embedding Predictive Architecture as implementation. JEPA is LeCun’s proposed mechanism for training a world model from raw sensory data. Rather than predicting in observation space (every pixel of the next video frame), JEPA predicts in abstract representation space—passing both seen and unseen portions through encoders that distill them into compact representations, then predicting the unseen representation from the seen. This frees the system from modeling unpredictable detail and allows it to focus on the structural regularities that matter for prediction and planning. The central technical challenge is preventing representational collapse: the system discovering that it can minimize prediction error by mapping everything to the same uninformative constant.

Grounding in perception rather than language. LeCun argues that world models must be trained from high-bandwidth sensory data—primarily video—because text is a compressed, discretized, already-interpreted shadow of reality. A child acquires a functional model of physical reality from months of watching and interacting with the world, without language instruction. A system that has only been trained on text has learned the record that language makes of reality, not the reality itself. The world model concept implies a research agenda centered on video understanding and sensorimotor learning rather than on scaling text prediction.

Explore more

Browse the full You On AI Field Guide — over 8,500 entries