CONCEPT

The Markov Property

The rule that the future depends only on the present—that to predict what comes next, the entire history is irrelevant once the current state is known—the disciplined amnesia at the heart of Markov’s chains and the deepest tension in the architecture of every language model that has ever tried to hold a conversation.

The Markov property is a bargain struck with complexity. The world’s true dependencies stretch backward without limit: for want of a nail the kingdom was lost, and meaning often lives in references established ten chapters before the point of use. To model all of that is computationally hopeless. Markov’s bargain was to compress the entire relevant past into the present state and then discard everything else. If the current state genuinely captures everything that matters for what comes next, the forgetting costs you nothing and buys you a mathematics you can actually compute. The property is simultaneously the chain’s genius and its deepest limitation: it produces tractable, exact theorems about convergence and stationary distributions, and it discards precisely the long-range dependencies in which human meaning lives. A Markov process cannot know that a pronoun refers to a noun established three paragraphs earlier, because the state it occupies has forgotten everything before the last step. The entire arc of language model development—from Markov’s bigrams through n-grams to the transformer’s attention mechanism—is an engineering struggle to recover what the property threw away: the reach of the past, the dependencies that span documents rather than sentences, the coherence that requires remembering what was said before the context window’s edge. The property has been stretched enormously by each generation of that struggle. It has never been abolished. Every language model that exists today has a context window of finite length, and everything beyond its left edge is forgotten as completely as a bigram forgets everything but the last letter.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI documents, in its accounts of human-AI collaboration, the specific ways in which these systems are powerful and the specific ways in which they fail. The Markov property is the structural root of the most characteristic class of failure: the model that holds a conversation beautifully within a session but loses the thread across sessions, that maintains coherence within its context window and loses it at the boundary, that cannot track a commitment made before the window’s edge or honour a reference established before it can see. These are not engineering failures awaiting better design. They are expressions of the property’s amnesia at the scale of the window’s limit.

The property also enters the cycle as the structural explanation for why these systems cannot, without external scaffolding, develop the kind of accumulated relationship with a person that characterises meaningful collaboration. A human colleague remembers the conversation of six months ago; a Markov process knows only its context window. The difference is not merely quantitative. It is the difference between a relationship, which is constituted by accumulated shared history, and a service, which is constituted by the quality of the present interaction. Every mechanism for extending AI memory—retrieval-augmented generation, explicit memory stores, session summaries fed back into the context—is an attempt to restore the reach of the past that the property discards.

Origin

Markov introduced the property as the defining condition of his chains in his 1906 paper. The property is formally stated as: a stochastic process has the Markov property if the conditional probability distribution of future states, given the present state, is the same as the conditional probability distribution of future states given all past states and the present state. In mathematical notation: the present state screens off the past from the future. The property was designed to be restrictive—Markov knew it did not hold for all processes—but to hold for a broad and practically important class, and to enable exact theorems wherever it held.

Higher-order Markov processes, in which the current state is defined as a window of the last k observations rather than the single last observation, allow the property to hold while encoding more of the recent past. This is precisely the move from bigrams to n-grams: increase k, and the model can honour dependencies reaching further back, at the cost of an exponential explosion in the number of possible states. The transformer architecture escapes this explosion by replacing the enumeration of joint states with a learned function—the attention mechanism—that computes, on the fly, how much each position in the context contributes to the prediction at the current position.

Key Ideas

Controlled forgetting. The property is a deliberate choice to compress all relevant history into the present state. If the compression is lossless for prediction—if the present state genuinely captures everything that matters for the future—the forgetting is without cost. If the compression is lossy—if relevant information is discarded—the forgetting produces systematic errors: the chain will make predictions that would be corrected by information it has thrown away.

The exponential cost of memory. The most direct fix for the property’s amnesia—enlarging the state to include more history—pays an exponential price. A vocabulary of V words yields V possible one-word states, V² possible two-word states, V^k possible k-word states. The number of parameters required to specify the transition distribution over k-word states grows exponentially in k. This exponential wall, which blocked n-gram models from extending their memory beyond a handful of words, is what the transformer’s attention mechanism was designed to circumvent: by computing, rather than storing, the relevant history at each step.

The window’s edge. Even the largest transformers have a finite context window. Beyond its left edge, the Markov property’s amnesia applies in full: the model has no access to what was said or established before the window began. As context windows have grown from hundreds to millions of tokens, the amnesia line has moved. But moving a line is not eliminating it. The dependencies that matter most are often the ones that reach farthest—the thesis established in the introduction of a long document, the promise made in an early session, the character revealed in a first conversation. These fall outside every finite window. The Markov property is not defeated by a large context window; it is merely pushed to a more distant horizon.

Debates & Critiques

The live debate concerns whether the transformer’s attention mechanism represents a genuine departure from the Markov property or its most sophisticated instantiation. The argument that it departs: attention allows every position to directly consult every other position in the context window, with no restriction to a fixed preceding window; in this sense the model has access to the full context, not just a local neighbourhood. The argument that it does not depart: the context window is still finite; the model processes the window as its “present state” and has no access to anything outside it; what attention does is compute a richer encoding of that bounded present, not extend the present to infinity. The practical implications turn on which view is correct. If the transformer genuinely escapes the property, then the remaining failures attributable to context-length limits are engineering problems awaiting bigger windows. If the transformer is the property’s most elaborate implementation, then those failures are expressions of a structural feature that will move with the window but never be abolished. Markov’s lesson is that the honest question is mathematical, and the mathematical answer currently favours the second view.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

Related Entries

Further Reading