The Turing Test, originally the Imitation Game, was proposed by Alan Turing in his 1950 paper "Computing Machinery and Intelligence." A human judge exchanges text-only messages with two entities — one human, one machine — and tries to determine which is which. If the judge cannot reliably tell them apart, the machine is said to have passed. The test sidesteps metaphysical questions about consciousness and locates intelligence in observable behavior.
For decades the Turing Test functioned as a distant north star: a clear operational definition that no system was close to meeting. With the advent of large language models, the test became suddenly practical — and, many argue, obsolete. Modern LLMs routinely produce conversational output that passes casual Turing-style inspection, yet the field has largely shifted to more specific measures (benchmark suites, task performance, alignment).
The philosophical value of the test remains. It clarifies that behaviorist definitions of intelligence are more tractable than mentalist ones — at the cost of being satisfied by systems we may not want to call intelligent.
The Turing Test was a defining benchmark precisely because it was clearly out of reach. With that distance gone — a contemporary language model will often pass a casual text-only Turing Test — the question has become: what was the test testing, and was that thing ever a good proxy for intelligence? The standard critical view in 2025 is that the test operationalized fluency rather than understanding. Its historical importance was to move the debate from "what is consciousness?" to "what is behavior?" Whether that was progress or evasion is contested.
Turing, A. M. "Computing Machinery and Intelligence." Mind 59 (October 1950). The paper explicitly proposes replacing "Can machines think?" with the more tractable imitation-game question.
Behavioral indistinguishability. The test defines intelligence by conversational output, not by internal mechanism.
The Chinese Room objection. John Searle's 1980 argument that a system could pass the Turing Test by rote symbol manipulation without "understanding." Remains the most-cited critique.
Variants. Total Turing Test (embodied behavior too), Reverse Turing Test (CAPTCHA), Winograd Schema Challenge (disambiguation as proxy for language understanding).
What a "reliable judge" means. Turing's 1950 paper stipulated that the judge should be an "average interrogator," not an expert in detecting AI. That choice — often glossed over — matters: expert judges today can still often spot LLMs, while average users often cannot. The test's meaning depends on who is judging.