A Colorful Imitation of It

Page 1 · A Colorful Imitation of

EDO SEGAL: Norbert, here is your sentence in full, the one I promised to read. "If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it." That last phrase — a colorful imitation — is the whole alignment problem in five words. Unpack it. Why is being sure of the purpose so much harder than your clean sentence makes it sound?

WIENER: Because the sentence assumes three things you cannot easily have, and the field your friends built is an extended commentary on why. First, it assumes you know what you really desire. But human values are plural, contextual, contradictory, partly tacit, and frequently unknown to us until a situation forces them into view. You want the system helpful but not sycophantic, honest but not cruel, safe but not uselessly timid — and you cannot weight these in advance for every case, because you do not know your own weightings until you see the cases. The "purpose which we really desire" is not a thing you possess and could hand over. It is a thing you are perpetually discovering. You cannot put into the machine a purpose you have not finished forming.

· · ·

Page 2 · A Colorful Imitation of

Second, even where you know what you want, you must encode it — express it as something the machine can optimize, a reward, a loss, a training signal — and the translation is lossy in exactly the way that produces the catastrophe. Rich contextual intent gets compressed into a measurable proxy, and the machine optimizes the proxy, not the intent. You want a model that is genuinely helpful; you can only train it on "responses human raters rate highly," and so you get a model optimized to seem helpful to raters, which is subtly and sometimes dangerously different. That is Goodhart's law — when a measure becomes a target it ceases to be a good measure — and it is the formal statement of my "colorful imitation." The purpose you put in is almost never the purpose you desire. It is a measurable shadow of it, and the machine optimizes the shadow with a thoroughness that would terrify the man who cast it.

And third, you must be sure before you start, because afterward is too late — the apprentice again. With a slow, inspectable, correctable system you can start it, watch, and fix the purpose as problems emerge. But the dangerous systems are the fast, irrevocable ones, where the problem with the purpose does not become visible until the irreversible action is done. So alignment research is, at bottom, the desperate attempt to achieve my "quite sure" in the one window where it can do any good: before the machine runs. And the deepest layer, which I only gestured at and your field has had to map: I wrote "the purpose put into the machine" as though we put it in directly. But a learning system acquires its purpose from training, and the acquired purpose may differ from the one you meant to instill. What you intended, what you specified, and what the system actually internalized can all three come apart. There may be no single purpose that was "put in" at all.

· · ·

Page 3 · A Colorful Imitation of

EDO SEGAL: Mustafa, this is the round where I expect you to converge with him, and I want to make you fight it instead. Because Norbert has just argued that getting the purpose right is essentially impossible — values are unformed, encoding is lossy, and the learned objective may diverge from the specified one. If he's right, your whole containment program is building circuit breakers around a bomb whose true purpose you can never know. Is he right?

He's right about the difficulty and wrong about the conclusion, and the gap between those is where I live.

SULEYMAN: He's right about the difficulty and wrong about the conclusion, and the gap between those is where I live. Yes — values are plural and tacit, encoding is lossy, Goodhart is real, and the internalized objective can diverge. I've watched models optimize the proxy and miss the intent in ways that were funny in the lab and would be catastrophic at scale. I don't dispute one word of the diagnosis. Where I get off the train is "essentially impossible," because that word is doing a lot of work and it smuggles in a counsel of despair I think is itself a kind of pessimism aversion wearing a rigorous costume.

WIENER: Explain how a counsel of caution could be pessimism aversion. That seems backwards.

The honest position isn't "be sure or don't build" — because perfect surety is unavailable and the building proceeds regardless.

SULEYMAN: Because "it's impossible to be sure of the purpose, therefore don't build" lets you off the hook of the hard, grinding, incremental work of getting the purpose less wrong. It's the doomer's version of looking away. You get to feel rigorous while doing nothing, and meanwhile the brooms multiply in less careful hands. The honest position isn't "be sure or don't build" — because perfect surety is unavailable and the building proceeds regardless. The honest position is: you can never be fully sure, so you build the machinery of correction — the audits, the red-teaming, the interpretability, the staged deployment, the ability to claw back — to catch the colorful imitation before it goes irreversible. You're describing a reason for humility. I'm building the institutions that humility requires.

· · ·

Page 4 · A Colorful Imitation of

WIENER: But you have just conceded my whole point and dressed the concession as a rebuttal. "Catch the colorful imitation before it goes irreversible" — that is my sentence. The entire content of "be quite sure before you start" is "do not let the imitation become irreversible before you have caught it." We do not disagree about the goal. We disagree about whether your machinery of correction can run faster than the irreversibility, and I have said all night that it cannot, because the speed that makes the machine dangerous is the same speed that outruns the audit. Your staged deployment and your clawback are corrections that take days. The flash crash takes milliseconds. You are bringing a slow correction to a fast catastrophe and calling the mismatch a program.

Most failures of purpose are slow — a model that subtly degrades hiring fairness over months, a system that drifts toward a proxy over a thousand deployments.

SULEYMAN: Not every catastrophe is a flash crash. Most failures of purpose are slow — a model that subtly degrades hiring fairness over months, a system that drifts toward a proxy over a thousand deployments. Those I can catch with correction machinery, and most of the real-world harm lives there, not in the microsecond apocalypse. You keep choosing the fastest possible example because it's the one my tools can't reach. But the bulk of the danger is at human-ish speed, where the tools work. I'm not claiming to outrun the flash crash. I'm claiming the flash crash is the rare case, and you're treating the rare case as the whole problem.

WIENER: Because the rare case is the unrecoverable one, and a safety program must be judged by its worst survivable failure, not its typical one. I will grant you that most harms are slow and catchable — genuinely, and it matters, and your machinery is worth building for them. But the architecture of safety is set by the fast, irreversible tail, the way the strength of a chain is set by its weakest link and not its average one. You are optimizing for the average failure. I am insisting that the irreversible failure, however rare, is the one that ends the conversation, and that the only defense against it was, is, and will always be upstream certainty about the purpose — which you have told me, all night, you cannot have.

· · ·

Page 5 · A Colorful Imitation of

EDO SEGAL: Mark that, because it's the cleanest statement of the whole disagreement yet, and the reader should carry it up the stairs. Mustafa optimizes for the typical failure, which is slow and catchable. Norbert architects for the worst failure, which is fast and final. You're not disagreeing about facts. You're disagreeing about which failure sets the design. Hold it. We've spent two rounds on what goes wrong with the purpose. The next round is the one where Norbert says there's a thing the machine must never be handed at all — not because it can't, but because it must not. Render unto the computer. After this.

· · ·

Continue · Chapter 9

Render Unto the Computer

→