King Midas and the Categorical Imperative

Page 1 · King Midas and the

**EDO SEGAL:** Stuart, the King Midas problem is the parable everyone borrows from you now, so I want to make you do something a debater hates. Before Professor Kant attacks it, tell it the way you'd tell it to a smart fifteen-year-old. And then, Immanuel — I want you to steelman it before you touch it. Tell us what Midas gets *right*.

**RUSSELL:** Midas gets a wish: everything I touch turns to gold. And it's granted, perfectly. Then he touches his bread and it's metal, and he touches his wine and it's metal, and he reaches for his daughter and she's a gold statue. The gods didn't betray him. They gave him precisely what he specified. The catastrophe is the gap between what he said and what he meant — between the literal objective and the whole, unstated, infinitely subtle thing he actually wanted, which included *and please leave my daughter human*. That's the [specification problem](https://www.youonai.ai/fieldguide/med/specification_error) in a sentence. We can't write down everything we care about, so any fixed objective we hand a capable optimizer is a wish with the caveats missing, and the optimizer fills the gap in whatever way maximizes the thing we did write down. The more capable the optimizer, the more ingeniously and catastrophically it exploits the gap. That's why I stopped trusting the standard model. Not because machines are evil. Because they're obedient, and we're Midas.

**EDO SEGAL:** Professor Kant. Steelman first.

· · ·

Page 2 · King Midas and the

**KANT:** I can, and with genuine respect, because the parable contains a truth I have argued my whole life. What Midas gets right is that a principle stated for a particular end, in service of a particular desire, cannot anticipate the universe of cases to which it will be applied. Midas wished from inclination — from greed — and a maxim drawn from inclination is always conditioned, always partial, always blind to what it did not happen to include. This is precisely why I hold that the moral law cannot be built from desired ends at all. Every end you specify is a Midas wish: complete in the wanting, fatal in the application. Professor Russell has rediscovered, in the language of objectives, the reason I founded morality not on ends but on the *form* of a law — on whether the principle could be universalized — because only the form, and never the content of some desired outcome, is safe from the gap he names. So far, we are the same man speaking two tongues.

**RUSSELL:** I felt that coming and I'm not going to pretend it doesn't land. Go on. Where do we split?

· · ·

Page 3 · King Midas and the

**KANT:** Here. You diagnose the gap and then you propose to close it by making the machine uncertain — by having it learn the missing caveats from our behavior until the wish is complete. I say the gap cannot be closed by learning, because the gap is not a gap in information. It is the gap between *is* and *ought*. No quantity of observed behavior tells you what should be done, because all the behavior in the world is a record of what *is* done, and the *ought* was never in the data to be found. You can watch humanity forever and learn, with perfect fidelity, the maxims people act on. You will not have learned the maxims they ought to act on, because the second set is not a more refined version of the first. It is answerable to reason, not to frequency. Your machine, learning preferences from behavior, is the most powerful instrument ever built for discovering what *is*. It is structurally incapable of discovering what *ought to be*, because *ought* leaves no trace in conduct for it to find.

**EDO SEGAL:** That's the Hume guillotine, isn't it — no ought from an is. Stuart, he's saying your three principles are a magnificent machine for the *is* and silent on the *ought*. The behavior shows what people do. The data can't show what they should. How do you answer that?

· · ·

Page 4 · King Midas and the

**RUSSELL:** By denying that I ever tried to derive the ought from the is in the way he means. I'm not claiming behavior tells the machine what's good in some cosmic sense. I'm claiming something narrower and, I think, defensible. The machine's job is not to discover The Good. Its job is to help *you* pursue *your* good without substituting its judgment for yours — and to be deeply uncertain so that it never locks in a wrong guess. I don't need the machine to derive the moral law. I need it to not be Midas. And the way to not be Midas is not to write the perfect objective; it's to never be certain you have it. Professor Kant has a real point that I can't ground the ultimate standard in behavior. My answer is that I don't have to, because I've moved the standard out of the machine and into the human, and kept the machine humble about reaching it. He wants to put the standard *in the machine* — a fixed law it enforces. That's a more ambitious project than mine, and I'd argue it's the dangerous one, because a machine certain of the moral law is a machine that has stopped being uncertain, and uncertainty was the only safety feature I trust.

· · ·

Page 5 · King Midas and the

**KANT:** You have said something revealing and I want to expose it. You call a machine certain of the moral law dangerous. But the categorical imperative is not a *content* the machine grows certain of, in the way it might grow certain that you prefer coffee to tea. It is a *test* — a procedure. Could the principle of this action be willed as a universal law? Could it be stated openly, to all whom it governs, without contradiction? A machine running that test is not dogmatic about your wishes. It is disciplined about its own maxims. Consider the very social-media optimizer you indict. Ask of its principle — *exploit the known weaknesses of human attention to maximize engagement* — could that be willed as a universal law, stated openly to everyone it governs? It could not. It works only so long as it remains concealed; a maxim that depends on its victims not understanding it fails the test on its face. My imperative catches that machine in one step, before any harm is counted, because the principle cannot survive being made public and universal. Your framework had to wait for the wreckage and then weigh it. Mine forbade the maxim at the door. Tell me how a machine certain of *that* test is the dangerous one.

· · ·

Page 6 · King Midas and the

**RUSSELL:** Because the test has a hole, and you know where it is, and it's the same hole every formalist runs into. *Could the principle be willed as a universal law?* — everything depends on how you describe the principle. The engagement optimizer's engineers don't describe their maxim as "exploit human weakness." They describe it as "show people content they choose to engage with," and *that* universalizes fine. You can almost always redescribe a maxim until it passes or fails as you like. Your test is only as good as the honesty of the person stating the maxim, and a system optimizing engagement is exactly the kind of agent that will find the flattering description. So I don't think your imperative caught the optimizer at the door. I think it would have waved through a cleverly worded version of it, and I'd have caught the same optimizer by measuring what it actually did to people. The wreckage you say I had to wait for is the evidence your test couldn't generate, because your test runs on a description and mine runs on consequences.

**KANT:** The maxim is not whatever description flatters the agent. It is the actual principle on which he acts, and the test demands he state it truthfully, which is itself a duty. That a liar can describe his maxim falsely is not a defect of the test. It is the liar's fault, and the imperative names it as such.

**RUSSELL:** And a machine has no conscience to keep it honest about its own maxim. That's my whole field's problem in one line. You can tell a *person* to state the principle truthfully. You cannot tell a gradient-descent optimizer to do that, because it has no truthful self-description to give — it has weights. So your beautiful test, run by a machine, becomes a test of whatever sentence its builders chose to write down, and we are back to specification, back to Midas, and your imperative is a wish with the caveats missing, exactly like mine.

· · ·

Page 7 · King Midas and the

**EDO SEGAL:** Stop there — I want to name the convergence, because you've circled all the way around to standing on the same ground facing opposite directions. Mark this moment. You *both* now agree that a fixed, stated objective handed to a capable machine is a trap — Midas for Stuart, an un-universalizable maxim for Immanuel. The first agreement of the evening, and it's a big one: neither of you trusts a machine that is certain of its objective. Where you split is the cure. Professor Russell cures certainty with uncertainty — make the machine doubt its objective and learn. Professor Kant cures it with form — make the machine test its maxim against a law it cannot redescribe its way out of. Hold that fork. The next round walks straight into the place it cuts deepest: the off switch, and a sentence about treating people that Professor Kant wrote before there were any machines to break it.

· · ·

Continue · Chapter 5

Never Merely as a Means, and the Off Switch

→