The surprise discovery that predicting the next token in a sequence, scaled enormously, produces behavior indistinguishable from reasoning — the unexpected channel through which general intelligence arrived.
Large language models are trained on a single objective: given a sequence of tokens, predict the next one. There is no reasoning term in the loss, no planning term, no understanding term. Everything the model does at inference time — chain-of-thought math, coding, conversation, explanation, creative writing — is downstream of that one objective, applied to trillions of tokens over months of compute. The fact that next-token prediction, scaled sufficiently, produces behavior that passes for intelligent reasoning is the most consequential empirical surprise of the last decade of AI research. It is the phenomenon the field is still theorizing after having discovered it.
Text Prediction as Architecture
In The You On AI Field Guide
The prior expectation, through the 2010s and earlier, was that intelligence would require explicit representations: rules, frames, ontologies, structured knowledge bases. The classical AI program built these by hand (CYC, SOAR, SHRDLU); the statistical-AI program built them from data (decision trees, Bayes nets, SVMs). Neither produced anything like general capability. The transformer-plus-scale