CONCEPT

Minimum Description Length

Jorma Rissanen’s operationalisation of Kolmogorov complexity into a computable model-selection criterion: the best model of a dataset is the one that minimises the total number of bits needed to describe the model plus the data encoded with the model’s help—a principle that dissolves overfitting at its root and identifies the right amount of complexity as the amount that minimises total description length.

The minimum description length principle is the practical machine you can build from Kolmogorov’s uncomputable ideal. Kolmogorov complexity defines the best possible model as the one producing the shortest program for the data—but that shortest program cannot in general be found. Jorma Rissanen, working in the 1970s from Kolmogorov’s and Solomonoff’s foundations, replaced the uncomputable ideal with a computable proxy. Instead of the shortest program, use two codes: one for the model itself, one for the data given the model. Minimise the sum. The result is a criterion that simultaneously penalises model complexity (you pay bits to describe your hypothesis) and model error (you pay bits to correct the model’s remaining mistakes). A model that is too simple cannot describe the data compactly: high error cost. A model that is too complex can describe the data exactly but at exorbitant cost: high description cost for the elaborate hypothesis. The minimum of the combined cost is the model that has captured the real structure and stopped there—that has compressed the signal and refused to compress the noise. Overfitting and underfitting are revealed as the two slopes of a single valley; minimum description length finds the floor. The principle is not merely an elegant re-framing of model selection. It is a proof that the right amount of complexity is the amount determined by the data, not guessed by the practitioner, and that this amount is the amount at which compression and generalisation coincide—which is to say, at which learning has actually occurred.

In the [YOU] on AI Field Guide

The cycle that began with [YOU] on AI asks what it means for a model to have genuinely learned, rather than merely memorised. Minimum description length gives the sharpest available answer: a model has learned when it compresses its training data, when its description plus the data it explains is shorter than the data alone. A model has merely memorised when its description is as long as the data it stores. Compression and generalisation are, under this principle, the same achievement measured two ways: a model that has compressed the data has captured its regularities, and regularities are exactly what carry over to unseen data drawn from the same source.

The principle also enters the cycle as a precise account of why the discipline of compression—of being unable to store everything and therefore forced to understand something—is the source of generalisation. A model with unlimited capacity, free to store its training data verbatim, has no reason to find structure; it can simply memorise, and memorisation is the failure of compression. The bottleneck is the teacher. It is the constraint of limited description length that drives the discovery of pattern, because pattern is the only way to fit a large world through a small channel. This inverts the naive intuition that more capacity is always better: what MDL predicts, and what empirical machine learning has borne out in ways designers did not always anticipate, is that the discipline of compression—of being forced to find a shorter description—is what produces the structure that generalises.

Origin

Jorma Rissanen introduced the principle in his 1978 paper “Modeling by shortest data description,” drawing on Kolmogorov complexity and Solomonoff’s theory of inductive inference. The central idea—that the best model minimises the two-part code of model plus data-given-model—was independently developed in related forms by Wallace and Boulton (minimum message length, 1968) and by Akaike and Schwarz through information criteria. The family of related approaches shares the core Kolmogorovian intuition: simpler hypotheses should be preferred, and simplicity is measurable as description length.

MDL: The Valley Between Overfit and Underfit

The relationship to modern machine learning is most direct through the identification of MDL with the Bayesian marginal likelihood and through the minimum description length interpretation of training objectives. When a language model minimises its training loss—the average number of bits needed to predict each next token—it is, in the dominant theoretical account, performing MDL inference: finding the model that minimises the description length of the data.

Key Ideas

Two-part code. The model is described in one code; the data is encoded using the model in a second code. Minimise the sum. A model too simple leaves high residual error—many bits to correct its mistakes. A model too complex costs many bits to describe the hypothesis. The minimum of the sum is the right model: the one that has compressed the signal without compressing the noise.

Overfitting dissolved at the root. The MDL framework reveals overfitting and underfitting not as separate problems requiring separate fixes but as the two slopes of a single valley, with the correct model at the bottom. Overfitting is paying a low data-encoding cost by paying an exorbitant hypothesis-description cost; the sum is large. The right model minimises the sum, and the sum is minimised at the point where real structure is captured and random variation is refused.

Compression is generalisation. A model that has minimised its MDL cost has compressed the training data, and compression implies the capture of regularities that hold beyond the training sample. The bits saved are the bits of real structure. Real structure is what the test set shares with the training set, because both are drawn from the same source. The equivalence of compression and generalisation is not a metaphor but a mathematical relationship, visible most clearly in the MDL framework.

The code is a bias. MDL in practice uses a chosen coding scheme rather than the uncomputable Kolmogorov ideal, and the choice of code is an inductive bias. Different codes define different notions of simplicity, and the code that a neural network’s architecture implicitly employs—its built-in vocabulary of short descriptions—determines what it can learn easily and what it cannot learn at all within any feasible training budget. The search for better architectures is, under the MDL interpretation, the search for codes in which the truths we care about have shorter descriptions.

Debates & Critiques

The deepest debate about MDL in the context of AI concerns whether the compression interpretation of training is illuminating or whether it is a post-hoc theoretical gloss on an empirical practice that proceeds without it. Proponents argue that the compression view makes legible why certain architectural choices work, why regularisation helps, and why the generalisation gap is smaller for models that minimise their training loss on large, diverse datasets. The compression view predicts that the better the model compresses its training data, the better it generalises—a prediction that has been confirmed empirically across modalities. Sceptics argue that the identified quantities—training loss as description length, model complexity as code length—are only approximately the MDL quantities, and that the gap between the approximation and the ideal may matter enough to limit the theoretical account’s explanatory power. A further debate concerns whether the MDL perspective adds to or replaces the statistical learning theory perspective (VC dimension, PAC bounds): most practitioners would say the two are complementary, giving different angles on the same phenomenon of generalisation.

In the [YOU] on AI Field Guide

Origin

Key Ideas

Debates & Critiques

Related Entries

Further Reading