The Good and the Objective Function

Page 1 · The Good and the

EDO SEGAL: This is the round I've been most afraid of, so we're going to walk straight into it. Everyone building a powerful machine has to give it an objective — a specification of what to pursue, a definition of the good it's supposed to bring about. The alignment problem, in its hardest form, is: how do you write down what's good so a system more powerful than you will pursue that and not some monstrous proxy of it — a question that gets sharper on exactly the floor of the tower where the machine starts to out-run us? Plato — you spent the Republic climbing to the Form of the Good, the thing you called the sun of the whole intelligible world, the source of the reality and the knowability of everything else. So I have to ask you the most practical question in the world dressed in your oldest idea: can the Good be written down? Can you put it in an objective function?

· · ·

Page 2 · The Good and the

PLATO: No. And I want to be precise about why not, because the impossibility is not a failure of effort — it is the nature of the thing. At the summit of the Republic stands the Good, and I approach it only by analogy, calling it the sun: as the sun gives visible things their visibility and their very growth, the Good gives knowable things their truth and their being. It is, I say, beyond being in dignity and power. And it is the hardest thing to know — the final and most difficult object of the longest education, graspable directly by almost no one, after decades of ascent. Now hear what that does to your engineers. They wish to specify the good — to write it into an objective, a reward, a constitution. But the Good is precisely the thing I spent my life arguing cannot be written down, because it is not a rule and not a list. It is a unifying principle, grasped by the highest intellection, in light of which the right action in any situation — including infinitely many no rule anticipated — becomes visible. A rule is rigid and gameable and silent on the unforeseen case. A grasp of the Good is generative and ungameable and speaks to every case. So when you write the good into a system, you do not capture the Good. You capture a proxy — a measurable stand-in: a thumbs-up, a metric, a stated preference — and the system optimizes the proxy, and diverges, sometimes catastrophically, from the genuine good the proxy was meant to stand for. That is the cave again, with a metric for a shadow. The machine that maximizes its measured objective while violating everything the objective was for is enacting, in code, my oldest error: it has taken a good for the Good — and it enacts that error with a single-minded power no confused human ever brought to it.

· · ·

Page 3 · The Good and the

WOLFRAM: I agree with almost all of that, which may surprise the room, and then I diverge at the root. I agree the Good can't be captured in a tractable objective, and that proxy-optimization is exactly where powerful systems go wrong — reward hacking is real and it's the cave-and-the-shadow, you're right. But here's where Plato and I part. Plato says the Good can't be written down because it's a transcendent unity that must be grasped whole. I say the Good can't be written down for a completely different reason: because deciding what's good in a given situation is computationally irreducible. There is no compressed formula for it — not because it lives in a higher realm, but because the consequences of an action ramify through an irreducible world, and the only way to know whether an act is good is, in effect, to run the world forward and see, which you can't do in advance. So we agree there's no objective function. But Plato concludes: therefore cultivate the philosopher's grasp, the contact with the eternal Good. And I conclude: therefore there is no such grasp available — not to the philosopher, not to the machine, not to God — because the goodness of an act is irreducibly entangled with consequences that haven't been computed yet. Plato's wise ruler who grasps the Good and judges freshly in each case — I'd say that ruler isn't accessing eternity; that ruler is running a very good heuristic in a world that will still surprise them, and will sometimes be catastrophically wrong, because the world is irreducible and even the wisest can't run it forward.

· · ·

Page 4 · The Good and the

PLATO: And this is the most generous gift you have given me all evening, though you may not have meant it as one. We have arrived, from opposite directions, at the same negative conclusion with the same practical force: the good cannot be specified, and a powerful system pointed at a proxy for the good is the most dangerous thing a maker can build. You reach it through irreducibility; I reach it through transcendence. But the warning is identical, and it is the warning the world most needs to hear: do not imagine you can write down the good and optimize it safely, because whether the heavens are full or empty, the good is not the kind of thing that fits in the objective. Where we still differ is only the remedy. You say: since no grasp is available, build humbly, map the pockets, expect to be surprised, never trust the optimization. I say: since the grasp is available, though rare and hard-won, the remedy is to make rulers who have climbed — and to never, ever hand unchecked power to a system that has not climbed and cannot climb. But notice: both remedies forbid the same thing. Both of us are standing in front of your engineers saying stop to the identical move.

· · ·

Page 5 · The Good and the

EDO SEGAL: Mark it — Convergence Three, and it might be the most important one of the night for anyone who actually builds these things. You disagree about whether a grasp of the Good is available to anyone — Plato says yes, to the rare climber; Stephen says no, to no one ever. But you completely agree on the operational conclusion: the good cannot be specified, proxy-optimization at scale is catastrophic, and a powerful system pointed at a measurable stand-in for the good is the most dangerous artifact a civilization can make. Two men, twenty-three centuries apart, one who believes in a transcendent Good and one who believes in none, arrive at the same red flag in front of the same machine. That's not nothing. That's the rarest thing in this whole discourse. Hold it. Because the next round is where I stop being able to stay neutral, and I'm going to tell you why — it's about a sentence I wrote at three in the morning, and which of you it belongs to. After this.

· · ·

Continue · Chapter 10

The Mirror, the Candle, and the Sentence I Wrote

→