But the Turing test was not the right criterion, and Hofstadter had been arguing this for decades. The test fails along three dimensions. First, it conflates fluency with understanding. Fluency is the capacity to produce linguistically appropriate responses. Understanding is the capacity to grasp what those responses refer to, why they are relevant, how they connect to other knowledge. A system can be perfectly fluent without understanding anything it produces.
Second, the test evaluates only behavior under normal conditions. Under familiar questions and well-represented domains, the machine's performance is indistinguishable from a human's. Under genuinely novel questions, poorly represented domains, or expert scrutiny, the performance degrades in ways a human's would not. The Deleuze failure is the paradigm case: under casual examination it passed any Turing test; under expert scrutiny it collapsed. The Turing test, by design, evaluated only the casual examination.
Third, and most deeply, the test assumed intelligence was the only possible explanation for intelligent-seeming behavior. If intelligence was the only process capable of producing human-like output, behavioral indistinguishability was a valid diagnostic. But the assumption was false. LLMs demonstrate that human-like behavior can be produced by a process fundamentally different from human intelligence — statistical pattern-processing rather than structural understanding. The test cannot distinguish these processes because it was not designed to. It evaluates outputs. The processes are invisible in the outputs.
What is needed, in Hofstadter's view, is not a replacement but a supplement: evaluations that address what the Turing test cannot reach. Can the system identify the boundaries of its own competence? Can it distinguish warranted from unwarranted confidence? Can it recognize when it is operating at the edge of its training distribution? Can it evaluate the depth of its own analogies — tell you not just that a connection is illuminating but why, where it breaks down, how far it can be pushed? These are not behavioral questions in the Turing sense. They are questions about self-knowledge, about the strange loop, about the capacity for recursive self-evaluation that Hofstadter's framework identifies as constitutive of genuine intelligence.
Hofstadter has argued against the sufficiency of the Turing test for decades, most notably in his contributions to The Mind's I (1981). The explicit claim that the test is dead emerges in his engagements with large language models beginning around 2022, as the behavioral criterion began to be routinely satisfied by systems whose architectural absence of understanding was, in Hofstadter's framework, unchanged.
Behavioral indistinguishability satisfied. Current LLMs routinely pass casual Turing tests.
Three failure modes. Fluency-understanding conflation, normal-conditions bias, one-process assumption.
Process invisibility. The test evaluates outputs; the architectural differences are invisible in outputs.
Self-knowledge as new criterion. What the machine knows about what it produces is the diagnostic the Turing test cannot reach.
Practical urgency. Deployment decisions are being made without the framework to evaluate them.