The Wharton experiment was designed as a test of whether large language models could perform a paradigmatically human task — the giving of ethical advice by a philosopher with specific training and experience. The result unsettled the discourse. If the output is indistinguishable, what does the philosopher possess?
Cognitive scientist Gary Marcus articulated the objection that Appiah's framework makes philosophically rigorous: the evaluators were rating the product, not the process. They could assess whether the advice sounded moral, trustworthy, and thoughtful. They could not assess whether the advisor occupied a position from which genuine advice can be given — whether the source possessed the practical wisdom, the biographical specificity, the accumulated engagement with real human dilemmas that Appiah has called the conditions of ethical counsel.
Segal's foreword describes sitting with the result for a long time. It was the experiment that cracked a particular part of the fishbowl — not the part about capability, which he had already accepted, but the part about what capability is worth when it can be reproduced at the cost of a subscription. Appiah's body of work, developed across four decades, constitutes the most sophisticated available response.
The Wharton researchers acknowledged they 'did not design this study to put Dr. Appiah out of work.' The reassurance missed the deeper point. The question is not whether Appiah will keep his column. The question is whether a civilization that can generate ethical advice computationally will continue to value the kind of ethical thinking that arises from lived experience.
The Wharton study was conducted in 2023, followed by the UNC Chapel Hill and Allen Institute study later that year. Both studies used the New York Times Magazine 'The Ethicist' column as the human benchmark precisely because Appiah's reputation made the comparison maximally provocative.
Indistinguishable output. Evaluators could not distinguish GPT-4 responses from Appiah's, and in follow-up work preferred the machine's.
Output is not process. What the evaluators rated was the product of ethical reasoning, not the position from which it was produced.
The empirical provocation. The studies function as the empirical ground for the philosophical question the rest of Appiah's framework answers — what does the human philosopher possess that the machine does not?
Marcus's corrective. The crowd workers' evaluative framework may not capture the dimensions along which Appiah's advice is genuinely superior.