CONCEPT

Curve Fitting

Judea Pearl's classification, not insult—the claim that all of contemporary machine learning, however dazzling, lives on the first rung of the ladder, finding patterns in data while understanding nothing.

Judea Pearl famously dismissed the achievements of deep learning as "just curve fitting," and the phrase has been read as a sneer. It is not. For Pearl it is a classification. To fit a curve is to find a function that passes through your data points and generalizes to nearby ones—a magnificent thing to be able to do, which modern systems perform across spaces of staggering dimensionality. But a curve, however many dimensions it inhabits, is a creature of the first rung of the Ladder of Causation. It describes how observed variables move together; it says nothing about what moves what. You can fit a curve to the relationship between barometer readings and storms with exquisite accuracy and predict storms from barometers all day long, and you still cannot use it to decide whether smashing the barometer will prevent the storm—because that information is about mechanism, and the curve is about pattern. This is why Pearl holds that scaling makes a system better at the first rung and nothing else, and why the large language models that write and reason in fluent prose remain, in the strict sense he gives the word, on the rung beneath understanding. The wall they approach is built not of insufficient data or compute but of the difference between seeing and doing.

In the [YOU] on AI Field Guide

[YOU] on AI diagnoses a pattern that has puzzled many observers: the coexistence of superhuman fluency and absurd, basic error. The same system that drafts a competent legal brief will invent a case that does not exist and cite it with perfect confidence. "Curve fitting" is the concept that dissolves the paradox. To a first-rung machine the real case and the fabricated one are the identical act—the generation of text that fits the patterns of its training data—because telling them apart would require a model of what is real, and a curve-fitter has no model of what is real. It has only a model of what is typical.

The concept gives the cycle its most rigorous instrument for resisting the seduction of the surface—the same illusion ELIZA first exposed sixty years ago, when a crude pattern-matcher convinced users it understood them. Human beings have only ever encountered fluent causal language as the product of causal understanding, so we infer the understanding from the fluency—a heuristic that was reliable for our entire history right up until the moment a machine learned to produce the talk without the model. "Just curve fitting" is the standing rebuke to that inference, the reminder that fluency on the first rung is not understanding on the third, however similar they appear.

And it sharpens the cycle's account of the human difference. If the machines are curve-fitters—superb at pattern, blind to mechanism—then what they conspicuously lack is exactly what the cycle prizes: the capacity to ask why, to model a world that might have been otherwise, to imagine the road not taken. The curve-fitter is not a rival for this. It is a foil that shows us, by the difference, what our own intelligence is.

Origin

The phrase entered the AI conversation through a 2018 Quanta Magazine interview in which Pearl, asked about the deep-learning boom, said that all its impressive achievements amount to just curve fitting. He was not denying the achievements; he was locating them. A system that has mastered the first rung and mistaken it for the summit is, to Pearl, a system that does not know what it does not know—the most dangerous kind of ignorance there is.

The classification follows from his mathematics. Pearl's theorems establish that the information required to answer a rung-two question is, in the general case, simply not present in rung-one data, no matter how much of it you gather. You can know the joint distribution of every observable variable in the universe with perfect precision and still not know what would happen if you intervened on one of them, because intervention changes the distribution in ways the observed distribution cannot encode. Curve fitting, however vast, stays on the rung where it operates.

The seductiveness of scale, in Pearl's analysis, is that it keeps paying off on that rung, and the payoffs are visible and dramatic, while the ceiling it approaches is invisible. There is no warning light that flashes when a system has extracted all the causal-seeming behavior that association can supply. The model just keeps getting better at the only thing it was ever doing, and observers keep extrapolating that the trend will carry it into territory the trend cannot reach. Pearl's contribution is to mark the ceiling explicitly.

Key Ideas

A curve describes pattern, not mechanism. Fitting a function to data captures how observed variables move together. It is silent on what produces what—and that silence is the entire limitation, because the questions we most care about (what to do, what would have happened, what works) are questions about mechanism.

Scaling climbs higher on one rung. A larger model detects subtler patterns, captures longer-range dependencies, and interpolates more gracefully between examples. These are real gains, confined to a single kind of operation. No quantity of this operation, performed however well, becomes an operation of a different kind. Scaling is not a staircase to understanding.

Brittleness is the signature, not a bug. A curve-fitter is a creature of the distribution it was trained on. When the world shifts, it has no representation of why its knowledge held, so it cannot know that the knowledge has expired—it fails confidently and without insight. This is the same brittleness under distribution shift that Gary Marcus documents from cognitive science; Pearl gives it a causal name.

Mimicry can pass for understanding when conditions are stable. A model that has seen enough cause-and-effect described in text can produce sentences that sound like causal reasoning—ask what happens if you drop a glass, and it says the glass breaks, not because it has a model of glass and gravity but because the words co-occur in that pattern. The mimicry is nearly perfect within the training distribution and fails at the edges, where there was never any understanding beneath it.

Necessary but not sufficient. Pearl is not dismissing the systems. He holds that rung-one mastery is necessary—no causal agent can function without a powerful capacity to detect pattern—and that current systems have achieved something genuinely valuable on the rung they occupy. His warning is about category error: we have built powerful pattern detectors, called them minds, and set ourselves up to trust them with questions they are constitutionally unequipped to answer.

Explore more

Browse the full You On AI Field Guide — over 8,500 entries