In 2023, a Wharton School research team led by Christian Terwiesch took moral dilemmas of the kind Appiah addresses weekly in his New York Times Magazine 'The Ethicist' column and presented them to GPT-4. They then showed both sets of responses — Appiah's and the machine's — to hundreds of evaluators without identifying the source. The evaluators found no significant difference in quality. A subsequent study by researchers at UNC Chapel Hill and the Allen Institute for AI found that GPT-4o's ethical advice was rated by nine hundred evaluators as 'more moral, trustworthy, thoughtful and correct' than Appiah's own. The experiments became the empirical provocation that drove Segal's foreword and Appiah's implicit response across the book — not because they proved the machine was wiser but because they exposed the distinction between the product of ethical reasoning and the position from which ethical reasoning is done.
The Wharton experiment was designed as a test of whether large language models could perform a paradigmatically human task — the giving of ethical advice by a philosopher with specific training and experience. The result unsettled the discourse. If the output is indistinguishable, what does the philosopher possess?
Cognitive scientist Gary Marcus articulated the objection that Appiah's framework makes philosophically rigorous: the evaluators were rating the product, not the process. They could assess whether the advice sounded moral, trustworthy, and thoughtful. They could not assess whether the advisor occupied a position from which genuine advice can be given — whether the source possessed the practical wisdom, the biographical specificity, the accumulated engagement with real human dilemmas that Appiah has called the conditions of ethical counsel.
Segal's foreword describes sitting with the result for a long time. It was the experiment that cracked a particular part of the fishbowl — not the part about capability, which he had already accepted, but the part about what capability is worth when it can be reproduced at the cost of a subscription. Appiah's body of work, developed across four decades, constitutes the most sophisticated available response.
The Wharton researchers acknowledged they 'did not design this study to put Dr. Appiah out of work.' The reassurance missed the deeper point. The question is not whether Appiah will keep his column. The question is whether a civilization that can generate ethical advice computationally will continue to value the kind of ethical thinking that arises from lived experience.
The Wharton study was conducted in 2023, followed by the UNC Chapel Hill and Allen Institute study later that year. Both studies used the New York Times Magazine 'The Ethicist' column as the human benchmark precisely because Appiah's reputation made the comparison maximally provocative.
Indistinguishable output. Evaluators could not distinguish GPT-4 responses from Appiah's, and in follow-up work preferred the machine's.
Output is not process. What the evaluators rated was the product of ethical reasoning, not the position from which it was produced.
The empirical provocation. The studies function as the empirical ground for the philosophical question the rest of Appiah's framework answers — what does the human philosopher possess that the machine does not?
Marcus's corrective. The crowd workers' evaluative framework may not capture the dimensions along which Appiah's advice is genuinely superior.