PERSON

Andrey Markov

The Russian mathematician who, in 1913, stripped Pushkin’s Eugene Onegin to a ribbon of vowels and consonants to win an argument about free will, and in doing so produced the first statistical analysis of language and the foundational structure from which every language model, every PageRank algorithm, and every next-token prediction descends.

Andrey Markov did not invent the language model. He invented something stranger and, in the end, more powerful: a proof that statistical regularity implies nothing about what lies beneath it, built from the observation that dependent events can produce the same stable averages as independent ones. He built this proof by counting vowels in Russia’s most beloved poem, not because he loved poetry but because he needed to win a mathematical argument about the soul. His weapon in that fight—the Markov chain, a sequence in which the future depends only on the present—became the mathematical ancestor of every system that now completes your sentence. The direct line from his grids of Cyrillic letters runs through Claude Shannon’s information theory, through the n-gram models of speech recognition, through Google’s PageRank algorithm, and into the transformer architectures of large language models: all are machines for estimating the probability of what comes next given what came before, all are walking his chain. The most important contribution he made to the present moment, however, is not the chain itself but the lesson he drew from it: that you cannot read the richness of the interior from the regularity of the surface, that statistical convergence proves nothing about meaning or freedom or understanding, and that a mechanism with no inner life whatsoever can produce, at scale, output indistinguishable from the products of thought. He aimed this lesson at a colleague who wanted to find the soul in a statistic. We are now finding it useful aimed at ourselves.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI returns repeatedly to the question of what AI systems are, underneath their fluent surfaces. Markov’s framework is among the most deflationary available, and it is deflationary in a precise, mathematically grounded way that goes beyond vague scepticism. The chain computes likelihood: given what came before, which continuation is probable? It does not compute truth, or meaning, or value. It computes the probable, and the probable is what is frequent in the training data, and the frequent is not the same as the true, the right, or the just. Markov’s chain is the honest mechanical description of what these systems do, and Markov himself—who famously declared that he regarded the applicability of his mathematics to the world “with indifference”—described the indifference perfectly. The chain is indifferent: it generates the probable with no mechanism for caring whether the probable is true.

Markov’s quarrel with Pavel Nekrasov enters the cycle as a precise analogue of the present debate about machine understanding. Nekrasov saw stable statistical aggregates and inferred a rich interior reality—free souls, independent choices. Markov proved that an entirely mechanical system of pure dependence could produce identical aggregates, thereby showing that the surface implied nothing about the depth. The AI enthusiast who sees fluent, coherent output and infers understanding commits the Nekrasov error in modern dress: reading metaphysics off a statistical surface. But Markov’s framework also warns the sceptic: statistical behaviour does not prove the absence of understanding any more than it proves its presence. The surface is silent about what is behind it, in both directions.

The cycle also takes from Markov the sharpest account available of the distributional harms of systems that generate the probable. A language model trained on human text learns what is frequent. The frequent encodes every asymmetry, every bias, every stereotype casually reproduced across millions of documents. When the model generates the probable continuation, it quietly normalises the frequent—which is to say, it treats the historical record as the template for the present and the future. Markov’s proof that statistical regularity carries no normative weight was aimed at Nekrasov’s inference from statistics to freedom. The same proof applies, now, to the inference from statistics to what ought to be produced. The probable is not the just, and the machine that generates the probable is making no claim about justice. We must supply the claim from outside the chain.

His account of the Markov property—the rule that the future depends only on the present, that the entire prior history is compressed into the current state and everything else discarded—frames the entire arc of language model development as a struggle to recover what the property throws away. Human meaning lives in long-range dependencies: the pronoun that refers to a noun ten sentences back, the plot point established in chapter one and paid off in chapter twenty, the promise made in an early conversation and remembered in a later one. The transformer’s attention mechanism is an engineering attempt to escape the Markov property’s amnesia without paying the combinatorial cost of enumerating all possible joint states. It succeeds brilliantly—but the context window is still finite, and everything beyond its edge falls off the back of the world, forgotten as completely as Markov’s bigram forgot everything but the last letter.

Origin

Andrey Andreyevich Markov was born in Ryazan in 1856, trained at St. Petersburg University under Pafnuty Chebyshev, and inherited Chebyshev’s chair—and, with it, the tradition of Russian mathematics that cared more for rigorous application than for pure abstraction. He was a combative figure: he protested the Tsar’s veto of Maxim Gorky’s election to the Academy of Sciences, and he asked to be formally excommunicated from the Russian Orthodox Church in solidarity with Leo Tolstoy when Tolstoy was excommunicated. His intellectual combativeness found its purest expression in a decade-long feud with Pavel Nekrasov of Moscow University.

Nekrasov, a mathematician with a theological education, had argued in 1902 that the law of large numbers—the statistical theorem that averages stabilise over many trials—required independent events, and that since social statistics obeyed this law, individual human acts must be independent, which he took to constitute mathematical evidence for free will. Markov found this inference intolerable, not primarily for political reasons but because it rested on a mathematical error. The law of large numbers did not require independence; he would prove this by constructing sequences of emphatically dependent events that still converged. He needed a large body of naturally dependent data, and he found it in Alexander Pushkin’s verse novel Eugene Onegin. In January 1913 he presented to the Academy his analysis of the first twenty thousand letters of the poem, showing that the vowel-consonant sequences—emphatically dependent on each other—converged to stable statistical averages. The mathematical weapon he forged for this fight was the chain that carries his name.

Markov died in 1922, four years after the Russian Revolution and thirty years before the first electronic computer. He could not have foreseen that the mathematical structure he built to refute a claim about the soul would become the architecture of systems now reshaping the economy of every knowledge-work profession. The chain was not prophecy. It was a key that fitted a door not yet built.

Key Ideas

The Markov Chain. A sequence of states in which each state’s probability depends only on the state immediately preceding it—not on the full history of how the system arrived. This controlled forgetting is the chain’s genius: it makes the mathematics tractable, permits exact theorems about convergence and equilibrium, and captures a surprising range of real-world processes. Every n-gram language model, every instance of PageRank, and every transformer’s next-token prediction is, at its core, a version of this structure: states, transitions, the probability that one thing follows another.

The Markov Property. The future depends only on the present. Everything before the present is irrelevant once the current state is known. This is simultaneously the chain’s tractability and its deepest tension with human meaning, which lives in long-range dependency—in reference, in argument, in narrative continuity. The property is the design choice that made the chain computable and the limitation that every subsequent language technology has struggled to overcome.

Probability Without Metaphysics. Markov’s central lesson, aimed at Nekrasov and available to any age: statistical regularity implies nothing about what lies beneath it. A chain converges to stable averages; this tells you nothing about whether the process that generated those averages has freedom, intention, or understanding. Fluent output, coherent output, output that passes every statistical test of language—none of this establishes an interior. The surface and the depth are, as Markov proved, completely decoupled. This is the most important epistemological contribution his framework makes to the present debate about AI.

The Probable Is Not the True. The chain generates likelihood and only likelihood. It has no mechanism for representing whether a transition leads toward the true or the false, the real or the imagined. A system built on this foundation generates the probable, and the probable is correlated with the true—because true things tend to appear in training data—but the correlation is imperfect. Hallucination is the structural signature of this imperfection: a system that generates the probable will sometimes generate a probable falsehood, dressed in the same syntax of certainty as a probable truth, with no internal signal distinguishing the two.

The Stationary Distribution. Every well-behaved Markov chain eventually settles to a fixed probability distribution over its states—the stationary distribution—regardless of where it began. This convergence is the mathematical fact that makes PageRank possible (the fraction of time a random web-walker spends on each page is its stationary probability under the link graph) and the fact that connects the chain to Markov’s original proof against Nekrasov. Dependent events can converge. Convergence tells you nothing about independence.

Debates & Critiques

The debate Markov’s framework generates for AI runs in two directions. The first concerns whether the Markovian description is the whole story or only the mechanical substrate. Optimists argue that when the conditioning context is rich enough—when the “state” of the model encodes thousands of tokens through deep attention—the walk through the chain produces something that deserves a different name than “merely statistical.” The coherence, the contextual sensitivity, the capacity to track an argument across paragraphs—these are not what a naive bigram produces, and the difference between the two is not trivial. Emergent capabilities may constitute a qualitative change, not just a quantitative one. Sceptics, following Markov’s own lesson, counter that the surface—however rich—is silent about the depth. More sophisticated transition probabilities, computed over longer contexts, are still transition probabilities. The mechanism is the same; the behaviour is better; and Markov proved that better behaviour does not upgrade the mechanism’s metaphysical status. The second debate concerns the distributional harms of systems that generate the frequent. If the probable encodes historical injustice, and the system generates the probable at scale, then the system amplifies injustice without malice, as a structural feature of probability-based generation. Markov’s proof that statistical regularity carries no normative content is both the diagnosis and the limit of the remedy: the chain cannot be fixed from inside, because values are not a statistical property of sequences. They must be imposed from without, by the humans and institutions that deploy the chain.

The Chain and Its Limits

What Markov’s structure can and cannot reach

The Chain Can

Generate the Probable

From transition probabilities over states, the chain produces sequences that have the same statistical regularities as its training data. Coherence, fluency, and local plausibility are exactly what the chain is designed to produce, and it produces them with remarkable reliability.

The Chain Cannot

Reach Truth or Meaning

Transition probabilities encode likelihood. They contain no representation of whether a continuation corresponds to anything real, whether a claim is true, whether a choice is just. The chain generates the probable; truth, meaning, and value must be supplied from outside it.

The Chain’s Lesson

Statistics Prove Nothing

Markov’s foundational result: statistical regularity implies nothing about the richness of the interior. Fluent output does not establish understanding any more than stable social statistics establish free will. The surface does not settle the question.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

The Chain and Its Limits

Related Entries

Further Reading