The Ghost of Skinner

Page 1 · The Ghost of Skinner

**EDO SEGAL:** Noam, I want to start this round with the body you left on the floor in 1959, because it's gotten up and is sitting at this table. You reviewed B. F. Skinner's *Verbal Behavior*, and you ended behaviorism — the idea that language is a system of responses shaped by reinforcement, that the speaker is tuned by experience and experience explains the speaker. You showed the vocabulary was empty: call anything a "response controlled by a stimulus" and you've explained nothing. And now here is Ilya, who builds a system that learns language by statistical adjustment from experience, with no built-in grammar, and it *works*. So tell me plainly — and don't be polite about it — is the large language model the corpse of Skinner, reanimated and finally made to walk?

**CHOMSKY:** It is exactly that, and the irony is not lost on me. Skinner's claim was that you could account for verbal behavior through the statistical shaping of responses, without any internal structure — no grammar, no rules, nothing but the history of reinforcement. I argued this was hopeless as science, not because it couldn't be made to produce *some* behavior, but because it couldn't explain the central facts: productivity, the unboundedness of language, the speed and uniformity of acquisition on impoverished data. The large language model is Skinner's dream realized in hardware. It is pure association at a scale Skinner could never have imagined. And it does produce fluent language. So the empiricist says: you see, Chomsky, you were wrong, the structure was in the data after all.

· · ·

Page 2 · The Ghost of Skinner

But notice what's happened. Skinner predicted that a sufficiently trained organism would speak. He was wrong about the organism — the child does *not* learn that way — but you could say he was right about the principle if you supply enough data and enough parameters. And that is the whole trick. He was "right" only by abandoning the thing that made the claim interesting, which was that it should explain the *human* case. The machine vindicates Skinner the way building a mechanical bird vindicates the theory that birds fly by flapping — yes, and also no, because the airplane that actually carries you doesn't flap at all. The model flies. It does not flap. It tells you nothing about the bird.

**EDO SEGAL:** So let me hand it back to you sharpened: you're saying the machine is a vindication of Skinner that costs Skinner everything that mattered — it proves you can get the output without the science, which was precisely your charge against him. Is that the version you'd defend?

**CHOMSKY:** That's the version. The output was never in dispute. The science was. And the machine, by reproducing the output through a route no human takes, doesn't reopen the science. It closes it more firmly, by showing how much data you need if you *don't* have the faculty — which is a measure of how much the child brings, since the child has almost none of that data.

**SUTSKEVER:** May I take the other side of the corpse?

**EDO SEGAL:** That's why you're here.

· · ·

Page 3 · The Ghost of Skinner

**SUTSKEVER:** I think Professor Chomsky's account has a gap exactly in the middle, and everything lives in the gap. He says: the machine does pure association, statistics, Skinner's dream, nothing inside. But that is a claim about the *mechanism of training*, and he is smuggling it into a claim about the *contents of the result*. The training signal is next-token prediction, yes — that part is humble, even Skinnerian. But what gets built by that training is not a table of associations. It is a deep, layered representation in which, demonstrably, concepts live. You can open these networks now. You can find directions in their activation space that correspond to truth and falsehood, to sentiment, to whether a chess position is winning, to the physical layout of a room described in text. The thing Skinner said didn't exist — internal structure, representations that aren't in the surface — is precisely what the network grows in order to predict. Skinner's *method* was vindicated and Skinner's *metaphysics* was refuted by the same machine, which I find a much more interesting result than either of us declaring victory.

**CHOMSKY:** Then we agree on something surprising — that the machine refutes Skinner's metaphysics. Where we part is what the internal structure *is*. You find a direction in activation space that tracks winning chess positions. Wonderful. That is a statistical regularity the network extracted because it reduced prediction error. It is not a *theory* of chess, and more to the point it is not the specific, bounded, biologically given structure that defines a human faculty. You've shown the network isn't a flat lookup table. I never thought it was. I think it's a deep lookup of staggering sophistication, and depth is not the same as the kind of structure I mean.

· · ·

Page 4 · The Ghost of Skinner

**SUTSKEVER:** But why not? That's the question I want to actually press, not slide past. You say it's not "the kind of structure you mean." When the network represents that a character who died in chapter two stays dead in chapter nine — represents it, uses it, and would pay in prediction error if it forgot — in what sense is that not knowledge of the situation? What is the extra thing the human has that makes the human's version "structure" and the machine's version "lookup"? I'm not being rhetorical. I genuinely want the name of the missing ingredient.

**CHOMSKY:** The name of the missing ingredient is the faculty — a specific, innate, species-uniform system that determines not what the model *can* learn but what it *cannot*. The human child cannot learn the impossible language. The machine can. That "cannot" is the whole content of my theory, and the machine has no "cannot." It will learn anything you show it, which means it has captured nothing about why human language takes the narrow shape it takes and not the infinite other shapes the data would permit. You keep pointing at what the machine *can* do. I keep pointing at what it *fails to be unable to do*. Those are different sciences.

**EDO SEGAL:** Let me sit on this one before I move us, because there's a thing under it I want both of you to touch. Ilya, the very generality Noam is using against you — that the same architecture learns text and proteins and chess, that it has no faculty specific to anything — your field treats that generality as the crown jewel. The lesson your colleagues call the bitter one: general methods that leverage [scale and computation](https://www.youonai.ai/fieldguide/med/scaling_laws) keep beating methods that build in human knowledge. So when Noam says "it has no language faculty," part of you wants to say "exactly, and that's the point." Square that for me.

· · ·

Page 5 · The Ghost of Skinner

**SUTSKEVER:** That's exactly right, and I'll own it rather than dodge it. The history of my field is a graveyard of human-designed structure. People built vision systems full of hand-crafted edge detectors; a general network that learned its own features buried them. People built translation systems full of linguistic rules; a general sequence model that learned end-to-end buried them. Every time we tried to install our cleverness, the general learner given enough data and compute did better by finding its own. So yes — the lack of a dedicated faculty is not a bug I'm apologizing for. It's the thesis. But here's where I think Professor Chomsky and I can actually both be right, and it's subtle. The bitter lesson says you shouldn't build in *the wrong kind* of structure — the brittle, propositional, hand-coded rules. It does not say there's *no* structure. The architecture is structure. The objective is structure. What the bitter lesson really teaches is that the *useful* priors are weak, general, and architectural, not strong, specific, and symbolic. Professor Chomsky's faculty is the strong specific kind. My bet is that you can get the constraints he cares about out of the weak general kind. That's the whole fight, and it's an empirical one.

**CHOMSKY:** And I'd say the bitter lesson is bitter for a reason your field rarely states, which is that it's a lesson about *engineering*, not about *minds*. "General methods plus computation beat human-designed structure" — at the task of building a system that performs. Granted, and impressively so. But the human child is the single most spectacular counterexample to the bitter lesson in the known universe: it beats every general method on sample efficiency by a factor of millions, precisely *because* it is not a general learner but a specific one, built by evolution to acquire human language and nothing else. The bitter lesson is true for the engineer and false for the organism. Your machines win by abandoning the very thing that makes the child miraculous. That's not a refutation of me. It's a demonstration that there are two completely different ways to get to language, and you've found the expensive one.

· · ·

Page 6 · The Ghost of Skinner

**SUTSKEVER:** "The expensive one" — I might steal that. Though I'd note the child's cheapness took evolution a few hundred million years to install, so the bill came due somewhere. We just paid it in compute instead of deep time.

**CHOMSKY:** A fair point, and a deep one. Evolution is the training run none of us got to watch.

**EDO SEGAL:** I want to mark this, because the reader can't see your faces, and that was the first exchange where neither of you reached for a concession. Let me name the strange topology we've found. You agree the network has internal structure — you both just buried Skinner's metaphysics together. You disagree about whether structure that has no boundary, that will absorb the impossible as readily as the possible, is the *kind* of structure that constitutes a mind. Noam says the boundary is the mind. Ilya says the contents are the mind. Hold that — the boundary returns in two rounds, and it returns with an impossible language in its hand. Next, the thing under all of it: how a child learns from almost nothing.

· · ·

Continue · Chapter 4

How a Child Learns From Almost Nothing

→