The cycle that began with [YOU] on AI documents, in its accounts of human-AI collaboration, the specific ways in which these systems are powerful and the specific ways in which they fail. The Markov property is the structural root of the most characteristic class of failure: the model that holds a conversation beautifully within a session but loses the thread across sessions, that maintains coherence within its context window and loses it at the boundary, that cannot track a commitment made before the window’s edge or honour a reference established before it can see. These are not engineering failures awaiting better design. They are expressions of the property’s amnesia at the scale of the window’s limit.
The property also enters the cycle as the structural explanation for why these systems cannot, without external scaffolding, develop the kind of accumulated relationship with a person that characterises meaningful collaboration. A human colleague remembers the conversation of six months ago; a Markov process knows only its context window. The difference is not merely quantitative. It is the difference between a relationship, which is constituted by accumulated shared history, and a service, which is constituted by the quality of the present interaction. Every mechanism for extending AI memory—retrieval-augmented generation, explicit memory stores, session summaries fed back into the context—is an attempt to restore the reach of the past that the property discards.
Markov introduced the property as the defining condition of his chains in his 1906 paper. The property is formally stated as: a stochastic process has the Markov property if the conditional probability distribution of future states, given the present state, is the same as the conditional probability distribution of future states given all past states and the present state. In mathematical notation: the present state screens off the past from the future. The property was designed to be restrictive—Markov knew it did not hold for all processes—but to hold for a broad and practically important class, and to enable exact theorems wherever it held.
Higher-order Markov processes, in which the current state is defined as a window of the last k observations rather than the single last observation, allow the property to hold while encoding more of the recent past. This is precisely the move from bigrams to n-grams: increase k, and the model can honour dependencies reaching further back, at the cost of an exponential explosion in the number of possible states. The transformer architecture escapes this explosion by replacing the enumeration of joint states with a learned function—the attention mechanism—that computes, on the fly, how much each position in the context contributes to the prediction at the current position.
Controlled forgetting. The property is a deliberate choice to compress all relevant history into the present state. If the compression is lossless for prediction—if the present state genuinely captures everything that matters for the future—the forgetting is without cost. If the compression is lossy—if relevant information is discarded—the forgetting produces systematic errors: the chain will make predictions that would be corrected by information it has thrown away.
The exponential cost of memory. The most direct fix for the property’s amnesia—enlarging the state to include more history—pays an exponential price. A vocabulary of V words yields V possible one-word states, V² possible two-word states, V^k possible k-word states. The number of parameters required to specify the transition distribution over k-word states grows exponentially in k. This exponential wall, which blocked n-gram models from extending their memory beyond a handful of words, is what the transformer’s attention mechanism was designed to circumvent: by computing, rather than storing, the relevant history at each step.
The window’s edge. Even the largest transformers have a finite context window. Beyond its left edge, the Markov property’s amnesia applies in full: the model has no access to what was said or established before the window began. As context windows have grown from hundreds to millions of tokens, the amnesia line has moved. But moving a line is not eliminating it. The dependencies that matter most are often the ones that reach farthest—the thesis established in the introduction of a long document, the promise made in an early session, the character revealed in a first conversation. These fall outside every finite window. The Markov property is not defeated by a large context window; it is merely pushed to a more distant horizon.