CONCEPT

Understanding Is Compression

Chaitin’s founding equation of algorithmic information theory: to comprehend a phenomenon is to find a description shorter than the phenomenon itself, a principle that turns out to be, almost word for word, the objective function of every large language model.

The single most useful idea Gregory Chaitin contributed to the study of minds and machines is also the simplest to state: understanding is compression. To comprehend a body of data is to find a theory shorter than the data that generates it—a law, a model, a program—from which the data can be regenerated. Newton did not memorize the positions of the planets; he found a handful of equations from which all those positions follow. The equations are vastly shorter than the data they predict, and the compression is the content of the understanding. Chaitin made this precise in the framework he founded simultaneously with Andrei Kolmogorov and Ray Solomonoff: the complexity of a thing is the length of its shortest program, and to understand the thing is to possess that program. A phenomenon that cannot be compressed at all—whose shortest description is the phenomenon itself—is incompressible and therefore, in principle, beyond understanding, because there is nothing to understand: no pattern, no regularity, no law. This is not a philosophical position but a mathematical theorem, and it has an immediate and unforgiving corollary for large language models: a training run that minimizes prediction error is, by a precise mathematical equivalence between prediction and compression, searching for the shortest program that reproduces the statistical regularities in its data. The machines that now unsettle us are, at the level of their objective function, compression engines. Chaitin built the mathematics of what compression can and cannot reach—and the two results are inseparable, because the same yardstick that measures understanding also marks the place where understanding ends.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI uses Chaitin’s equation as its most rigorous tool for calibrating what to trust in large language models and where to hold back. The reliability of a model’s output is, on this account, a function of how compressible the relevant territory is: how much genuine regularity exists for the model to have found. On well-trodden, heavily-evidenced, pattern-rich ground, the compression is deep and the output is trustworthy. On thin, novel, idiosyncratic, or genuinely random ground—the edges of human knowledge, the unprecedented case, the question whose answer was never in the training data—there is little to compress, the model extrapolates a regularity that does not exist, and the output should be trusted to exactly the degree that the territory is regular, which is to say, not much. Chaitin’s framework converts the vague intuition that these systems are “good at some things and bad at others” into a principled criterion.

The equation also bears on the question of whether these systems understand anything at all. The skeptic who insists they merely manipulate symbols without understanding owes an account of what understanding could be, over and above the discovery of the shortest description. If comprehension just is compression—and the argument from algorithmic information theory is rigorous—then a system that compresses the regularities of human knowledge has thereby comprehended those regularities, in the only sense the mathematics of understanding can give the word. The uncomfortable possibility Chaitin’s work raises is not that the machines fail to understand but that understanding was always a more mechanical, more information-theoretic thing than we wanted it to be.

Origin

The equation emerges from the field Gregory Chaitin founded simultaneously with Andrei Kolmogorov and Ray Solomonoff in the early 1960s: algorithmic information theory. Its founding idea is a definition of randomness by reference to description length. A string of bits is random if the shortest program that generates it is no shorter than the string itself; a string has pattern—is compressible—to the extent that a shorter program can be found. The complexity of a thing is the length of its shortest program, and since compression is understanding, complexity is a measure of how far the thing is from being understood.

The mathematical equivalence between prediction and compression was established independently and is now a standard result: a model that assigns accurate probabilities to sequences is implicitly a model that compresses those sequences well, and vice versa, with the compression ratio equal to the cross-entropy loss. This means that when a language model is trained to minimize its prediction error over a training corpus, it is, in the most precise sense available, being trained to compress that corpus into the smallest model that reproduces its statistical structure. The objective function of modern deep learning is Chaitin’s objective function, operationalized at scale.

Key Ideas

Complexity as description length. The complexity of a thing is the length of its shortest description—the smallest program that generates it. This is both a measure of how hard the thing is to understand and a measure of how much genuine structure it contains. A simple thing is one for which a short description exists; a complex thing is one that cannot be compressed much. Most things are complex in this sense: the number of short programs is vastly smaller than the number of possible strings.

The corollary: incompressible = unintelligible. A phenomenon whose shortest description is the phenomenon itself carries no pattern, no regularity, no law. No theory shorter than the data exists to be found. It is not that we are not clever enough; it is that there is nothing to understand. This is the hard ceiling on any compression engine, including the neural networks that power contemporary AI.

Confidence calibration by compressibility. The reliability of any compression engine’s output in a given region is proportional to how compressible that region is—how much genuine regularity exists there. High regularity, high reliability; low regularity, confabulation. Crucially, the engine cannot in general know which region it is in, because determining compressibility is itself uncomputable. The machine cannot tell, from inside, whether it is extrapolating a real pattern or generating plausible noise.

The information conservation law. Understanding is compression and compression conserves information: a system with L bits of information in its weights plus inputs cannot produce output containing substantially more than L bits of genuine information. Apparent information in a model’s output can always exceed real information, since fluency is unconstrained by information content. This is the precise form of the limit Chaitin’s framework sets on what any AI system can derive.

The dignity of statistical knowledge. Chaitin’s equation also vindicates a middle category between empty pattern-matching and full comprehension. Just as Gregor Mendel possessed genuine, predictive, lawful knowledge of inheritance while remaining entirely ignorant of the mechanism, a language model may possess genuine structural knowledge of language—real, predictive, compressible regularities—while remaining ignorant of what the language means. Statistics without mechanism can still be real knowledge of real structure.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Related Entries

Further Reading