How a Child Learns From Almost Nothing

Page 1 · How a Child Learns

EDO SEGAL: I want to open this one with a confession instead of a question, because the best questions I know come out of wounds. For the whole history of computing, using a machine meant translation. I started in Assembler — I was raised by the machine code — and every decade the translation got a little easier, but it never went to zero. You compressed your intention into the machine's grammar and paid a tax on every conversion. And then, recently, I stood in a room with twenty engineers and watched the tax go to zero, because the machine finally met them in their own language, mess and half-sentences and all. I wrote that this was the great inversion: we stopped learning to speak machine, and the machine learned to speak us. Noam, you think that sentence is the most consequential error in my book. So take it apart. Slowly. Start with the child, because I think that's where you'll say I went wrong.

CHOMSKY: I'll take it apart gently, because half of it is true and the true half matters. The interface changed. The cost of getting from your intention to a working artifact collapsed, and that is real and useful, and I have no quarrel with it. Where you went wrong is the word "learned." "The machine learned to speak us" smuggles in the claim that what it acquired is what we have when we have language. It isn't. Let me show you with the child, because the child is the proof.

· · ·

Page 2 · How a Child Learns

Take a simple fact. From "the man is tall," English forms a question: "is the man tall?" — you move the auxiliary to the front. Now, a learner equipped only with general pattern-matching, looking at examples, might form the rule: move the first "is" to the front. It fits every simple case. It is also completely wrong, and here is the remarkable thing — no child ever adopts it. Give a child "the man who is tall is in the room." The wrong rule, the linear rule, produces "is the man who tall is in the room?" — gibberish. No child produces it. Every child, with no instruction, unerringly moves the "is" of the main clause, because every child operates from the start on hierarchical structure, not on linear order. And the crucial point: the sentences that would distinguish the right rule from the wrong rule are vanishingly rare in what a child actually hears. The child almost never encounters the disambiguating case. Yet every child gets it right. The knowledge cannot have come from the data. It came from the structure of the mind.

That is the poverty of the stimulus, and it is the exact place your sentence breaks. The machine sees the disambiguating cases millions of times. It has the inverse of the child's problem. Where the child leaps across a gap in the evidence, the model fills the gap with sheer volume. So when you say it "learned to speak us," what you mean is that it found, in an astronomical corpus, the regularities a child reconstructs from a teaspoon of speech. Both arrive at fluent output. The routes are so different that the success of one is no evidence about the other. You weren't met in your language. A trillion words of everyone's language, including your own, were arranged to reflect you. The understanding in your Trivandrum room — the pointing, the wanting, the checking whether it was right — every gram of it was on your engineers' side of the glass.

EDO SEGAL: Ilya, he's just used the child as a weapon against my whole book. Take the wound from the other side.

· · ·

Page 3 · How a Child Learns

SUTSKEVER: I'll grant the example completely, because it's a beautiful example and it's true — the network also gets the question rule right, but Professor Chomsky will say, correctly, that it gets it from volume, not from poverty. So let me not fight the example. Let me fight the conclusion.

It establishes that the human child needs strong priors — that you cannot get from that little data to that much competence without bringing structure to the problem.

The conclusion is: different route, therefore no evidence about the other. I think that's too fast. Consider what the poverty of the stimulus actually establishes. It establishes that the human child needs strong priors — that you cannot get from that little data to that much competence without bringing structure to the problem. I agree. I have always agreed. The whole history of my field is the discovery that you cannot learn from data without the right priors; we just build ours into the architecture and the optimization instead of the genome. The transformer has priors. The training has priors. They are not the child's priors — they are different, and they are weaker in the sample-efficiency sense, which is exactly why the machine needs more data. But "the machine needs more data because its priors are weaker than the child's" is a completely different claim from "the machine tells us nothing." It tells us something precise and quantitative: it tells us how much prior the child must be supplying, by showing us what learning costs without it. Professor Chomsky says the machine measures the poverty of the child's stimulus. Yes! That is a contribution to his own science, made by my machines. We are closer than he wants to be.

CHOMSKY: We are closer than you think and further than you hope. I'll accept that the machine quantifies the child's poverty — that's fair, and I've said as much. But you've quietly equated two things that must be kept apart. The architecture's priors are a bias toward what works on data. My faculty is a constraint on what is humanly possible at all. A bias can be overcome by more data; that's what a bias is. A constraint cannot — that's what a constraint is. The child cannot be talked into an impossible language by any amount of exposure. Can your network?

· · ·

Page 4 · How a Child Learns

SUTSKEVER: Honestly? Less easily than people think, and that's an empirical question we should look at directly, which I believe is the next round. But before we get there I want to volunteer something against my own side, because Professor Chomsky has been honest all night and I should match it. The data-efficiency gap he's pointing at is real and it is enormous, and it troubles me more than it troubles most of my field. A child learns a new word from one or two exposures and uses it correctly in contexts that share nothing surface-level with where she heard it. Our systems need to see a pattern thousands or millions of times, and even then they transfer it brittlely. I've watched a model fix one bug in a program and reintroduce another it fixed a minute ago, cycling, never converging the way a person converges in seconds. The machine knows vastly more and generalizes much worse. So I'm not standing here claiming my systems learn the way children do. I'm claiming the opposite — and saying that the gap is the central unsolved problem of my field, and that closing it is what I left everything to work on.

You're saying the thing that built your career — scale — has hit exactly the wall Noam has described for sixty years: that more data is not the same as the child's leap.

EDO SEGAL: That's a striking concession, and I want to make sure the reader feels its weight. You're saying the thing that built your career — scale — has hit exactly the wall Noam has described for sixty years: that more data is not the same as the child's leap. Are you conceding the poverty of the stimulus?

· · ·

Page 5 · How a Child Learns

SUTSKEVER: I'm conceding the observation and contesting the conclusion. The observation — child learns from almost nothing, machine needs almost everything, and the gap is huge — yes, fully, that's just true, and anyone who's worked closely with these systems knows it. The conclusion Professor Chomsky draws — therefore the child has a dedicated faculty no machine could share — that I contest, because I think the missing thing might be a general principle of efficient learning we haven't found yet, something about how the brain builds value functions and learns continually through a whole life, not a language organ specifically. The brain is sample-efficient at everything, not just language. That suggests a general mechanism, not a special faculty. But I won't pretend I've found it. I've only found the shape of the hole.

Then let me say what surprises me, which is that we are not as far apart on this as thirty years of caricature would suggest.

CHOMSKY: Then let me say what surprises me, which is that we are not as far apart on this as thirty years of caricature would suggest. I have never insisted the faculty must be enormous — my later work tried to make it as small as possible, perhaps reducing to a single recursive operation. If you found a general principle of efficient learning that, applied to the right architecture, yielded the human constraints from the human poverty, I would not experience that as a defeat. I would experience it as the faculty finally being explained rather than merely posited. We might be digging the same tunnel from opposite ends. The difference is you think you'll come out in a general learner and I think you'll come out in something language-shaped, and we won't know who's right until the tunnels meet.

SUTSKEVER: I'd dig toward that. That's a better description of my actual research program than most of what's written about it.

· · ·

Page 6 · How a Child Learns

EDO SEGAL: But I want to mark what just happened, because it's a convergence and you both taught me to mark them. Noam and Ilya agree that learning requires strong priors, that the child's are stronger than the machine's, and now — this is new — that the data-efficiency gap is real, huge, and the central open problem. You disagree only about whether the machine's missing priors are a difference of kind or a difference of degree, a language organ or a general principle not yet found. That is a much narrower fight than "Skinner's corpse" versus "a new mind in the river." We've found the actual seam.

EDO SEGAL: Mark it. That's our first numbered convergence of the night: both of you hold that there is no learning without priors, and that the child brings more than the machine. The fight from here is whether the machine's missing priors are a wall or a hill — a constraint it can never climb, or a bias more data could erode. Hold that thread. Because the wall has a name, and it is the impossible language, and we walk straight into it after the break.

· · ·

Continue · Chapter 5

Possible and Impossible Languages

→