
The cycle that began with [YOU] on AI asks its readers to see the machinery beneath the magic. The Markov chain is that machinery, named and made exact. When a language model produces a paragraph of fluent, coherent text, the mechanical description is: at each step, the model computed a probability distribution over possible next tokens given the context and drew one from that distribution; it then advanced to the new context and repeated the operation until done. This is a walk through a Markov chain of near-infinite state space, conditioned on a context window that may span thousands of tokens. The walk is local at every step—the model reaches only as far back as its context window—and the coherence it produces is an emergent property of how richly the context conditions the local choices, not of any global plan or intention.
The chain also enters the cycle as the source of the most precise account available of why these systems hallucinate structurally rather than incidentally. The chain generates the probable. A probable continuation is not necessarily a true one: truth and frequency are correlated in training data, but imperfectly so. When a model asserts a false claim with fluent confidence, it is producing the continuation that the transition probabilities—estimated from a corpus that contained fluent, confident false claims—make most probable. The hallucination is not a malfunction. It is the chain operating correctly on a probability distribution that conflates surface plausibility with factual accuracy. Fixing this requires something the chain cannot provide from inside: a representation of the distinction between the probable and the true, which is not a statistical property of sequences.
Markov introduced the chain in his 1906 paper as a technical counterexample to a theological argument. He proved that even sequences of dependent events—where each event’s probability depends on the preceding event—converge to stable long-run averages, the same behaviour that his opponent Nekrasov had attributed exclusively to independent events and inferred from it a mathematical proof of free will. Markov’s chains showed that you could have the statistical regularity without the independence, and therefore without any inference to the metaphysics of the soul.
In 1913 he applied the chains to 20,000 letters of Eugene Onegin, computing the conditional probability that a vowel or consonant would follow a vowel or consonant—the first bigram model in history. Claude Shannon extended this in 1948, using Markov models of English to generate text that grew eerily language-like as the model’s memory deepened. The direct chain of intellectual descent from Markov’s 1906 paper to the transformer architectures of present-day language models is unbroken.
The Markov Property. The future is conditionally independent of the past given the present. This is the defining rule of the chain: once you know the current state, the history of how you got there adds no predictive information. The property is simultaneously a mathematical convenience and a tension with human meaning, which lives in long-range dependencies that the property throws away—reference, narrative continuity, the kept promise. The entire arc of language model development is a struggle to stretch the “present” far enough to capture the dependencies that matter.
Transition Probabilities. The chain’s content is a matrix of numbers: for each possible current state, the probability of transitioning to each possible next state. These numbers must be non-negative and sum to one for each current state. In a language model, the “transition matrix” is not a stored table but a function computed by a vast neural network from the full context. The function is incomparably more powerful than a table, but the fundamental object it computes is the same: the probability of each next token given the current position.
The Stationary Distribution. Under mild conditions, a Markov chain eventually settles to a fixed distribution—one that is self-reinforcing under the chain’s own transition rule. This is the result Markov proved against Nekrasov: dependent events, run long enough, converge. It is also the mathematical foundation of Google’s PageRank algorithm, which defines the importance of a webpage as its probability under the stationary distribution of a random walk over the link graph.
Generation as a Walk. To generate a sequence, run the chain forward: from any state, sample the next state according to the transition probabilities, step there, and repeat. Shannon used this method in 1948 to generate letter-by-letter approximations to English. A modern language model generates token by token in exactly the same way, except that the state is the entire context window and the transition probabilities are computed by attention-based neural networks. The output is the record of the walk: a path through a vast space of possible continuations, shaped by the transition structure learned from training data.