CONCEPT

The Markov Chain

A mathematical structure in which the future depends only on the present—each state’s probability conditioned solely on the state immediately preceding it—the structure Andrey Markov forged in 1906 to win a mathematical argument about free will, and the structure that, extended to sequences of words and tokens, underlies every language model ever built.

The Markov chain is a controlled act of forgetting. It says: to predict what comes next, you need to know only where you are now. Everything that led to the present state—the full trailing history of how the system arrived—is irrelevant once the present state is known. This disciplined amnesia is what makes the mathematics tractable and what makes the chain so widely useful: it captures a genuine class of real-world processes while remaining computable. Andrey Markov built the structure in 1906 to demonstrate that the law of large numbers applies to dependent events as well as independent ones—a proof aimed at defeating a colleague who had inferred free will from the stability of social statistics. The weapon he built for that fight became the first formal model of language: when he applied his chains to the letter sequences of Pushkin’s Eugene Onegin in 1913, he produced the first statistical analysis of language in history. The chain did not stay narrow. Every n-gram model, every spell-corrector, every speech recogniser, Google’s PageRank algorithm—all are Markov chains over their respective state spaces. And every large language model’s core operation—next-token prediction, computing the probability distribution over the vocabulary given the context and sampling one token from it—is a walk through an implicit Markov chain of enormous scale. The architecture has grown unrecognisably more powerful. The question remains Markov’s: given what came before, what is likely to come next?

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI asks its readers to see the machinery beneath the magic. The Markov chain is that machinery, named and made exact. When a language model produces a paragraph of fluent, coherent text, the mechanical description is: at each step, the model computed a probability distribution over possible next tokens given the context and drew one from that distribution; it then advanced to the new context and repeated the operation until done. This is a walk through a Markov chain of near-infinite state space, conditioned on a context window that may span thousands of tokens. The walk is local at every step—the model reaches only as far back as its context window—and the coherence it produces is an emergent property of how richly the context conditions the local choices, not of any global plan or intention.

The chain also enters the cycle as the source of the most precise account available of why these systems hallucinate structurally rather than incidentally. The chain generates the probable. A probable continuation is not necessarily a true one: truth and frequency are correlated in training data, but imperfectly so. When a model asserts a false claim with fluent confidence, it is producing the continuation that the transition probabilities—estimated from a corpus that contained fluent, confident false claims—make most probable. The hallucination is not a malfunction. It is the chain operating correctly on a probability distribution that conflates surface plausibility with factual accuracy. Fixing this requires something the chain cannot provide from inside: a representation of the distinction between the probable and the true, which is not a statistical property of sequences.

Origin

Markov introduced the chain in his 1906 paper as a technical counterexample to a theological argument. He proved that even sequences of dependent events—where each event’s probability depends on the preceding event—converge to stable long-run averages, the same behaviour that his opponent Nekrasov had attributed exclusively to independent events and inferred from it a mathematical proof of free will. Markov’s chains showed that you could have the statistical regularity without the independence, and therefore without any inference to the metaphysics of the soul.

In 1913 he applied the chains to 20,000 letters of Eugene Onegin, computing the conditional probability that a vowel or consonant would follow a vowel or consonant—the first bigram model in history. Claude Shannon extended this in 1948, using Markov models of English to generate text that grew eerily language-like as the model’s memory deepened. The direct chain of intellectual descent from Markov’s 1906 paper to the transformer architectures of present-day language models is unbroken.

Key Ideas

The Markov Property. The future is conditionally independent of the past given the present. This is the defining rule of the chain: once you know the current state, the history of how you got there adds no predictive information. The property is simultaneously a mathematical convenience and a tension with human meaning, which lives in long-range dependencies that the property throws away—reference, narrative continuity, the kept promise. The entire arc of language model development is a struggle to stretch the “present” far enough to capture the dependencies that matter.

Transition Probabilities. The chain’s content is a matrix of numbers: for each possible current state, the probability of transitioning to each possible next state. These numbers must be non-negative and sum to one for each current state. In a language model, the “transition matrix” is not a stored table but a function computed by a vast neural network from the full context. The function is incomparably more powerful than a table, but the fundamental object it computes is the same: the probability of each next token given the current position.

The Stationary Distribution. Under mild conditions, a Markov chain eventually settles to a fixed distribution—one that is self-reinforcing under the chain’s own transition rule. This is the result Markov proved against Nekrasov: dependent events, run long enough, converge. It is also the mathematical foundation of Google’s PageRank algorithm, which defines the importance of a webpage as its probability under the stationary distribution of a random walk over the link graph.

Generation as a Walk. To generate a sequence, run the chain forward: from any state, sample the next state according to the transition probabilities, step there, and repeat. Shannon used this method in 1948 to generate letter-by-letter approximations to English. A modern language model generates token by token in exactly the same way, except that the state is the entire context window and the transition probabilities are computed by attention-based neural networks. The output is the record of the walk: a path through a vast space of possible continuations, shaped by the transition structure learned from training data.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Related Entries

Further Reading