
The cycle that began with [YOU] on AI asks what it means for a model to have genuinely learned, rather than merely memorised. Minimum description length gives the sharpest available answer: a model has learned when it compresses its training data, when its description plus the data it explains is shorter than the data alone. A model has merely memorised when its description is as long as the data it stores. Compression and generalisation are, under this principle, the same achievement measured two ways: a model that has compressed the data has captured its regularities, and regularities are exactly what carry over to unseen data drawn from the same source.
The principle also enters the cycle as a precise account of why the discipline of compression—of being unable to store everything and therefore forced to understand something—is the source of generalisation. A model with unlimited capacity, free to store its training data verbatim, has no reason to find structure; it can simply memorise, and memorisation is the failure of compression. The bottleneck is the teacher. It is the constraint of limited description length that drives the discovery of pattern, because pattern is the only way to fit a large world through a small channel. This inverts the naive intuition that more capacity is always better: what MDL predicts, and what empirical machine learning has borne out in ways designers did not always anticipate, is that the discipline of compression—of being forced to find a shorter description—is what produces the structure that generalises.
Jorma Rissanen introduced the principle in his 1978 paper “Modeling by shortest data description,” drawing on Kolmogorov complexity and Solomonoff’s theory of inductive inference. The central idea—that the best model minimises the two-part code of model plus data-given-model—was independently developed in related forms by Wallace and Boulton (minimum message length, 1968) and by Akaike and Schwarz through information criteria. The family of related approaches shares the core Kolmogorovian intuition: simpler hypotheses should be preferred, and simplicity is measurable as description length.
The relationship to modern machine learning is most direct through the identification of MDL with the Bayesian marginal likelihood and through the minimum description length interpretation of training objectives. When a language model minimises its training loss—the average number of bits needed to predict each next token—it is, in the dominant theoretical account, performing MDL inference: finding the model that minimises the description length of the data.
Two-part code. The model is described in one code; the data is encoded using the model in a second code. Minimise the sum. A model too simple leaves high residual error—many bits to correct its mistakes. A model too complex costs many bits to describe the hypothesis. The minimum of the sum is the right model: the one that has compressed the signal without compressing the noise.
Overfitting dissolved at the root. The MDL framework reveals overfitting and underfitting not as separate problems requiring separate fixes but as the two slopes of a single valley, with the correct model at the bottom. Overfitting is paying a low data-encoding cost by paying an exorbitant hypothesis-description cost; the sum is large. The right model minimises the sum, and the sum is minimised at the point where real structure is captured and random variation is refused.
Compression is generalisation. A model that has minimised its MDL cost has compressed the training data, and compression implies the capture of regularities that hold beyond the training sample. The bits saved are the bits of real structure. Real structure is what the test set shares with the training set, because both are drawn from the same source. The equivalence of compression and generalisation is not a metaphor but a mathematical relationship, visible most clearly in the MDL framework.
The code is a bias. MDL in practice uses a chosen coding scheme rather than the uncomputable Kolmogorov ideal, and the choice of code is an inductive bias. Different codes define different notions of simplicity, and the code that a neural network’s architecture implicitly employs—its built-in vocabulary of short descriptions—determines what it can learn easily and what it cannot learn at all within any feasible training budget. The search for better architectures is, under the MDL interpretation, the search for codes in which the truths we care about have shorter descriptions.