Possible and Impossible Languages

Page 1 · Possible and Impossible Languages

EDO SEGAL: This is the round I've been most afraid of, because it's the one where you might actually settle something, and settling things is not what debates are supposed to do. Noam, your sharpest empirical claim — the one that's gotten you the most grief from the field — is that these systems will learn an impossible language as readily as a real one. Lay it out for the reader as if to a smart fifteen-year-old. Then, Ilya, I want you to do the hard thing: tell us what the argument gets right before you tell us where it fails.

CHOMSKY: Gladly. Human languages, for all their surface variety, obey deep constraints. The dependencies that matter run along hierarchical structure — the tree — never along simple linear distance — the string. I gave you the question rule earlier: you move the auxiliary of the main clause, defined structurally, never the third word, defined by counting. Now, you can easily design a language that violates this — a language where, say, to negate a sentence you flip the order of every other word, or where agreement is determined by linear position rather than structure. These are perfectly well-defined systems. They are computable. They are also impossible in the precise sense that no human language is ever built this way and no human child could acquire one, because the faculty rules them out in advance. My claim is simple: a statistical learner with no faculty has no reason to prefer the possible to the impossible. Show it a corpus in an impossible language and it will model it about as well as English. And a learner that treats the possible and the impossible alike has, by definition, failed to capture the one thing that defines human language — the line between them. It has no theory of the line. It has dissolved the line into statistics.

EDO SEGAL: Ilya. Steelman first. What does the argument get right?

· · ·

Page 2 · Possible and Impossible Languages

SUTSKEVER: It gets something deeply right, and I want to honor it before I push, because it's the part the hype merchants in my own field ignore at their peril. It is genuinely true that a learning system's inductive biases are most of what it is, and that a system which is equally happy to learn anything has, in a real sense, learned nothing in particular. A learner with no preferences is not a learner. That's not just Professor Chomsky's insight; it's a theorem in my field — there is no free lunch, generalization requires bias. So the structure of his argument is correct: if these networks truly had no preference between humanlike and anti-humanlike structure, that would be a profound fact about what they are not. Every machine learning researcher should have that tattooed somewhere. There. That's the steelman, and I mean it.

Now the two places it fails, and they're both load-bearing. First, the empirical claim has actually been tested in the last few years, and it is not coming out cleanly the way Professor Chomsky needs it to. When researchers train these networks on impossible languages — the counting rules, the reversal rules, the linear-agreement rules — the networks generally learn them worse. Slower, with higher loss, less stable. The architecture is not neutral. It has a bias, and the bias leans, imperfectly but measurably, toward the kinds of structure human languages actually have. That's a real result, and it's exactly the kind of result the strong version of the impossible-language argument predicted could never happen.

CHOMSKY: "Worse" is not "cannot." That's the whole game. The child does not learn the impossible language worse. The child cannot learn it at all — it falls outside the faculty, the way a sound outside your hearing range isn't a quiet sound, it's no sound. Your network learning it "with higher loss" is a network learning it. A bias you can overcome with more data is not a constraint. You've shown the architecture has a preference. I never doubted it had preferences; everything has preferences. I doubted it had a boundary, and "worse but still learnable" is precisely the absence of a boundary.

· · ·

Page 3 · Possible and Impossible Languages

SUTSKEVER: That's the second place, and it's the real disagreement, so let me be careful. Professor Chomsky defines the human faculty by an absolute "cannot." I want to question whether the human "cannot" is as absolute as he needs it to be — because if it's also just "much worse," then we're arguing about the steepness of a hill, not the existence of a wall. We don't actually have many humans raised from infancy on a rigorously impossible language; the experiment is unethical and has never really been run. What we have is the strong intuition, from the structure of attested languages, that the faculty forbids certain things. That's good evidence. But "no attested language does this" is consistent with "the faculty makes it very hard" as well as "the faculty makes it impossible." And my networks are now data on that question. They suggest the relevant universals might be strong statistical biases of any good sequence learner, biases that fall out of the structure of the prediction problem itself — locality, hierarchy, the cost of long-distance dependencies — rather than a dedicated, genetically specified organ. That's not me waving my hands. That's me saying: the thing you attribute to a special faculty might be a property of the learning problem, and my machines are the first instrument we've ever had for telling those two apart.

· · ·

Page 4 · Possible and Impossible Languages

CHOMSKY: And that is the most serious version of your position, so let me give it its due and then say where it still fails. Suppose the universals do fall out of the structure of any efficient sequence learner. Even then — even granting you the strongest form — you have explained why a system that must predict text would favor these structures. You have not explained the human child, who is not primarily a text predictor, who acquires language for thought before and apart from communication, and who hits the constraints whether or not they're efficient for prediction. The properties of human language are, many of them, bad for communication — ambiguity, structure-dependence that complicates parsing. They look like the signature of a system built for internal computation, externalized only secondarily. Your account explains the externalized surface. It is silent on why the faculty has the shape it has, which shows up in things prediction would never select for.

I'll say only that "language is for thought, not communication" is itself a hypothesis, a beautiful one, that I am not sure the evidence forces.

SUTSKEVER: That is a good objection and I don't have a complete answer to it. I'll say only that "language is for thought, not communication" is itself a hypothesis, a beautiful one, that I am not sure the evidence forces. But I concede I'm now arguing past my data.

EDO SEGAL: I want to stop the room, because something important just happened and the reader should not miss it under the technical noise. You came into this round at "Skinner's corpse" versus "a new mind." You're leaving it agreeing that the architecture is not neutral, that it leans toward humanlike structure, and disagreeing about whether that lean is a property of the learning problem or the signature of a dedicated biological organ. That is a genuinely narrow, genuinely deep disagreement, and it is answerable, in principle, by experiments neither of you has finished running. Mark it as convergence number two: the machine is not a blank slate, and its biases are real and partly humanlike. The fight is now about where those biases come from. And that question — what the machine is really doing when it predicts — is Ilya's home ground. We go there next.

· · ·

Continue · Chapter 6

To Predict the Token, You Must Understand the World

→