The Schema That Was Passed and Not Solved

Page 1 · The Schema That Was

EDO SEGAL: Terry, in 2012 the computer scientist Hector Levesque proposed a test for machine intelligence and named it after you, because it was built from the kind of sentence you had used decades earlier to show how much understanding ordinary language quietly demands. Give us the canonical example — your own — and tell us what it was designed to defeat. And then I want to ask both of you about what happened when the machines passed it, because the result is one of the strangest things in this whole field.

"The city councilmen refused the demonstrators a permit because they feared violence." Who is "they"?

WINOGRAD: The example is mine, yes. "The city councilmen refused the demonstrators a permit because they feared violence." Who is "they"? Any competent reader knows instantly: the councilmen. Now change one word. "The city councilmen refused the demonstrators a permit because they advocated violence." Now "they" is the demonstrators. The grammar is identical. Nothing in the syntax tells you who the pronoun refers to. To resolve it you must understand how councilmen and demonstrators relate, who fears and who advocates, what a permit is for — you must, in short, understand the situation the sentence describes. Levesque's insight was that a large set of such pairs would be a test that, unlike the Turing Test, could not be passed by evasion or charm or canned jokes. You would have to actually understand. It was the SHRDLU lesson turned into a benchmark: a test the intelligence of the listener could not be smuggled into, where the machine would have to supply the understanding itself.

EDO SEGAL: And then?

· · ·

Page 2 · The Schema That Was

WINOGRAD: And then the machines passed it. As the large pretrained models scaled up, their performance on Winograd-style problems climbed steeply, and by 2019 systems were exceeding ninety percent. A harder, adversarially built version called WinoGrande was made specifically to put the difficulty back, and the models climbed that too. By 2023 the researchers closest to the Winograd Schema Challenge acknowledged what had happened: the test designed to require full-bodied understanding had been soundly defeated by systems that, in their own judgment, still seemed to lack exactly that. The benchmark named after the field's great skeptic of machine understanding was conquered by machines whose understanding the skeptic still doubts. I find that funny in a way that does not make me laugh.

I want to take this result seriously in the way it deserves, which is to refuse both of the easy readings.

HEIDEGGER: I want to take this result seriously in the way it deserves, which is to refuse both of the easy readings. One camp says: the machine passed, therefore it understands, and your skepticism is refuted. The other says: the machine passed by pattern-matching, therefore it understands nothing. Both treat the test as if it could settle the question. The vertiginous thing — the genuinely instructive thing — is that the test did not settle it. We built the cleanest possible behavioral examination of understanding. The machine passed. And we are no closer to agreement than before. What does that tell a thinking person? It tells you that what you mean by "understanding" is not exhausted by the behavior. If it were, the behavior would have ended the argument. The persistence of the disagreement, after the behavior has been achieved, is the proof that understanding lives somewhere behavior cannot reach. The schema is a koan. It was passed and it was not solved, and the gap between those two facts is the whole of my philosophy, demonstrated in a laboratory I never entered.

· · ·

Page 3 · The Schema That Was

EDO SEGAL: Let me restate that, Professor, because it is subtle and I do not want the reader to slide past it. You are saying the defeat of the test is not evidence the machine understands and not evidence it doesn't — it is evidence that understanding-detection from the outside is impossible, because understanding is not a behavior. That the experiment's real result is a fact about us: that we cannot stop meaning more by "understanding" than any test can measure. Is that the claim?

HEIDEGGER: That is precisely the claim, and notice that it is the strongest possible form of Herr Winograd's position, stronger than the version he usually defends. He need not assert that the machine definitely lacks understanding. He need only point out that producing the behavior did not produce the agreement — and could not, because the thing in question was never behavioral. Every floor of his argument rests on that, and the machine built him the cleanest evidence he could have wished for, by passing.

· · ·

Page 4 · The Schema That Was

WINOGRAD: I accept the gift, and now I am going to do the thing Heidegger will not do, which is to argue against myself, because I do not want to win this round too cheaply and a thinking reader should feel the pull of the other side. There is something suspicious about a standard for "real understanding" that no behavioral test could ever satisfy — that retreats, every single time a test is passed, to insisting the understanding still is not real. At some point a sufficiently robust competence starts to look like a distinction without a difference, and the demand for a "full-bodied" understanding behind the behavior begins to resemble a demand for a ghost. I have felt that objection land on me. My honest reply is not that the machines definitely lack understanding — I am too careful for that. It is that the question cannot be settled by benchmarks at all, because benchmarks measure performance and the whole issue is whether performance exhausts understanding. And the defeat of my own schema is, paradoxically, my best evidence for that meta-claim. We built the best behavioral test we could; the machines passed; we are no closer to agreement. That should tell us the question lives where behavior cannot go.

EDO SEGAL: But here is the trap I suspect you will refuse, Terry, and I want to spring it anyway, because it has been sitting under the table. If passing the test proves nothing, doesn't failing it also prove nothing? Haven't you built a position where no possible evidence could ever move you — where the machine that aces every test you devise is, by definition, still mimicking? Isn't that exactly the unfalsifiable framework your own critics accuse you of?

· · ·

Page 5 · The Schema That Was

WINOGRAD: That is the sharpest blow available against me and I will not flinch from it. Yes — my framework risks becoming able to explain away anything, and a theory that can absorb any result is predicting nothing, which is a different and more dangerous kind of weakness than being wrong. I feel the edge of that. Here is my defense, and you will judge whether it holds. There is an observation that would move me, and I will name it later tonight when Heidegger and I cross, because it is a real falsifier and not a fog. But it is not a benchmark, and it is not more fluency. It is a behavior of a specific structural kind — something that would show the involvement is the machine's own and not borrowed. Until that day, my refusal to count fluency as understanding is not unfalsifiable stubbornness. It is the recognition that the one thing in dispute — whether performance exhausts understanding — cannot be measured by more performance. That is not me moving the goalposts. That is me noticing the goalposts were never on the field the benchmarks are played on.

HEIDEGGER: And I would add only this, because Herr Winograd is being too modest about how deep the hole goes. The reason no test reaches it is the same reason the frame problem was never solved by adding more rules. Relevance — what stands out as mattering — is not computed from a list. It is given by care, by being a creature for whom the situation has stakes. The schema requires you to know that councilmen fear and demonstrators advocate, and you know it not because you looked it up but because you have lived among the fearful and the defiant and have a body that has felt both. The machine produces the right answer by the shadow that fact casts in the text. The shadow is reliable enough to pass. It is not the standing-in-the-light that cast it.

· · ·

Page 6 · The Schema That Was

EDO SEGAL: Mark this moment, because it is the first real convergence of the evening and agreements are news. You have arrived, from the armchair and from the workshop, at the same place: that the machine's passing of the test relocates the question rather than answering it, and that the place the question now lives is care — involvement, stake, the body that has felt fear and defiance. Number it the first convergence. The next round goes after the thing that produced the passing, the thing Terry has confessed surprised him and was honest enough to say so. Not structure. Not rules. Scale. Brute, dumb, staggering scale — and what a man raised to believe in careful systematic analysis made of watching the crude approach win. After this.

· · ·

Continue · Chapter 6

The Surprise of Scale

→