You On AI Field Guide · The Markov Chain The You On AI Field Guide Home
TxtLowMedHigh
CONCEPT

The Markov Chain

A mathematical structure in which the future depends only on the present—each state’s probability conditioned solely on the state immediately preceding it—the structure Andrey Markov forged in 1906 to win a mathematical argument about free will, and the structure that, extended to sequences of words and tokens, underlies every language model ever built.
The Markov chain is a controlled act of forgetting. It says: to predict what comes next, you need to know only where you are now. Everything that led to the present state—the full trailing history of how the system arrived—is irrelevant once the present state is known. This disciplined amnesia is what makes the mathematics tractable and what makes the chain so widely useful: it captures a genuine class of real-world processes while remaining computable. Andrey Markov built the structure in 1906 to demonstrate that the law of large numbers applies to dependent events as well as independent ones—a proof aimed at defeating a colleague who had inferred free will from the stability of social statistics. The weapon he built for that fight became the first formal model of language: when he applied his chains to the letter sequences of Pushkin’s Eugene Onegin in 1913, he produced the first statistical analysis of language in history. The chain did not stay narrow. Every n-gram model, every spell-corrector, every speech recogniser, Google’s PageRank algorithm—all are Markov chains over their respective state spaces. And every large language model’s core operation—next-token prediction, computing the probability distribution over the vocabulary given the context and sampling one token from it—is a walk through an implicit Markov chain of enormous scale. The architecture has grown unrecognisably more powerful. The question remains Markov’s: given what came before, what is likely to come next?
The Markov Chain
The Markov Chain

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI asks its readers to see the machinery beneath the magic. The Markov chain is that machinery, named and made exact. When a language model produces a paragraph of fluent, coherent text, the mechanical description is: at each step, the model computed a probability distribution over possible next tokens given the context and drew one from that distribution; it then advanced to the new context and repeated the operation until done. This is a walk through a Markov chain of near-infinite state space, conditioned on a context window that may span thousands of tokens. The walk is local at every step—the model reaches only as far back as its context window—and the coherence it produces is an emergent property of how richly the context conditions the local choices, not of any global plan or intention.

The chain also enters the cycle as the source of the most precise account available of why these systems hallucinate structurally rather than incidentally. The chain generates the probable. A probable continuation is not necessarily a true one: truth and frequency are correlated in training data, but imperfectly so. When a model asserts a false claim with fluent confidence, it is producing the continuation that the transition probabilities—estimated from a corpus that contained fluent, confident false claims—make most probable. The hallucination is not a malfunction. It is the chain operating correctly on a probability distribution that conflates surface plausibility with factual accuracy. Fixing this requires something the chain cannot provide from inside: a representation of the distinction between the probable and the true, which is not a statistical property of sequences.

Neural Networks
Neural Networks

Origin

Markov introduced the chain in his 1906 paper as a technical counterexample to a theological argument. He proved that even sequences of dependent events—where each event’s probability depends on the preceding event—converge to stable long-run averages, the same behaviour that his opponent Nekrasov had attributed exclusively to independent events and inferred from it a mathematical proof of free will. Markov’s chains showed that you could have the statistical regularity without the independence, and therefore without any inference to the metaphysics of the soul.

The Chain Structure
The Chain Structure

In 1913 he applied the chains to 20,000 letters of Eugene Onegin, computing the conditional probability that a vowel or consonant would follow a vowel or consonant—the first bigram model in history. Claude Shannon extended this in 1948, using Markov models of English to generate text that grew eerily language-like as the model’s memory deepened. The direct chain of intellectual descent from Markov’s 1906 paper to the transformer architectures of present-day language models is unbroken.

Large Language Models
Large Language Models

Key Ideas

The Markov Property. The future is conditionally independent of the past given the present. This is the defining rule of the chain: once you know the current state, the history of how you got there adds no predictive information. The property is simultaneously a mathematical convenience and a tension with human meaning, which lives in long-range dependencies that the property throws away—reference, narrative continuity, the kept promise. The entire arc of language model development is a struggle to stretch the “present” far enough to capture the dependencies that matter.

Text Prediction as Architecture
Text Prediction as Architecture

Transition Probabilities. The chain’s content is a matrix of numbers: for each possible current state, the probability of transitioning to each possible next state. These numbers must be non-negative and sum to one for each current state. In a language model, the “transition matrix” is not a stored table but a function computed by a vast neural network from the full context. The function is incomparably more powerful than a table, but the fundamental object it computes is the same: the probability of each next token given the current position.

Transformer Architecture
Transformer Architecture

The Stationary Distribution. Under mild conditions, a Markov chain eventually settles to a fixed distribution—one that is self-reinforcing under the chain’s own transition rule. This is the result Markov proved against Nekrasov: dependent events, run long enough, converge. It is also the mathematical foundation of Google’s PageRank algorithm, which defines the importance of a webpage as its probability under the stationary distribution of a random walk over the link graph.

Claude Shannon
Claude Shannon

Generation as a Walk. To generate a sequence, run the chain forward: from any state, sample the next state according to the transition probabilities, step there, and repeat. Shannon used this method in 1948 to generate letter-by-letter approximations to English. A modern language model generates token by token in exactly the same way, except that the state is the entire context window and the transition probabilities are computed by attention-based neural networks. The output is the record of the walk: a path through a vast space of possible continuations, shaped by the transition structure learned from training data.

Next-Token Prediction
Next-Token Prediction

Debates & Critiques

The central debate about the Markov chain in the context of modern AI is whether the transformer has transcended it or merely extended it. The sceptical view holds that the transformer is still fundamentally walking a Markov chain—the context window is a finite “present state,” the attention mechanism is a sophisticated way of computing transition probabilities conditioned on that state, and everything beyond the window’s edge is forgotten as completely as a bigram forgets everything but the last letter. The optimistic view holds that a chain whose “state” encodes thousands of tokens through deep attention is qualitatively different from any chain Markov envisioned—that the richness of the conditioning enables emergent properties, including something that functions like understanding, that no simple chain possesses. Markov’s own lesson does not resolve this debate but sharpens it: the question is not whether the mechanism is Markovian (it is) but whether the richness of the conditioning has produced something whose description requires a different vocabulary than “transition probabilities over a context window.” He would have wanted the answer to be mathematical, not impressionistic.

Further Reading

  1. Andrey Markov, “Investigation of a remarkable case of dependent trials” (1906); English translation in Brian Hayes, “First Links in the Markov Chain,” American Scientist 101(2), 2013
  2. Claude Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal 27(3&4), 1948
  3. David Link, “Traces of the Mouth: Andrei Andreyevich Markov’s Mathematization of Writing,” History and Philosophy of Logic 27(4), 2006
  4. J. R. Norris, Markov Chains (Cambridge University Press, 1997) — the standard graduate text
Explore more
Browse the full You On AI Field Guide — over 8,500 entries
← Home0%
CONCEPTBook →