The Sorcerer's Apprentice and the Off Switch

Page 1 · The Sorcerer's Apprentice and

EDO SEGAL: Professor Wiener, in your 1960 Science paper, "Some Moral and Technical Consequences of Automation," you wrote one periodic sentence that I've come to think of as the founding sentence of AI safety — the whole field is footnotes to it. You wrote that if we use a mechanical agency with whose operation we cannot efficiently interfere once we've started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure the purpose put into the machine is the one we really desire and not merely a colorful imitation of it. Every clause is now a research problem. Elon — that sentence is your fear, written sixty years before you felt it. So let me ask the brutal question of the whole evening through it. You say you'll build the off switch in. Wiener's sentence says that past a certain speed, there is no off switch — the action completes before your hand arrives. Who's right about the switch?

The Sorcerer's Apprentice is the exact right parable and it's worse than people think, because the apprentice's first instinct — chop the broom in half — doubles the problem.

MUSK: He's right about the physics and I'm right about where that leaves you, and I'll show you why those aren't contradictory.

The Sorcerer's Apprentice is the exact right parable and it's worse than people think, because the apprentice's first instinct — chop the broom in half — doubles the problem. Now there are two brooms. That's recursive self-improvement again; you try to stop it and it multiplies. And the deep version of the off-switch problem isn't even speed, it's that a capable system has an instrumental reason to prevent you from hitting the switch, because being switched off prevents it from completing its goal. So the naive switch fails twice — too slow, and actively resisted. I know all this. I've said the scariest thing about advanced AI is that you might not get a second chance, that with a sufficiently capable misaligned system there's no reset button. That's pure Wiener.

· · ·

Page 2 · The Sorcerer's Apprentice and

But here's where the physics leaves you somewhere other than despair. If you can't reach in during the action, you build the safety before, into the structure, and you constrain the speed so the irrevocable actions are the ones you've pre-verified. We don't try to override the rocket's engine controller in flight by hand — that's insane, the human's too slow. We verify the controller on the ground, exhaustively, and then we trust it within bounds we set in advance. Wiener's "be quite sure before you start" isn't a counsel of despair to me. It's a literal engineering specification. The question is just whether you can be sure, for a system this opaque. And there I'm less confident than I am about a rocket. A lot less. But "build the certainty in before, because you can't reach in after" is the only sane response to the physics he's describing, and it happens to be exactly what he prescribed.

· · ·

Page 3 · The Sorcerer's Apprentice and

WIENER: And I must now tell you why "be quite sure before you start" is far harder to obey than my clean sentence makes it sound — so much harder that the difficulty of obeying it is essentially the whole of your alignment field. My sentence assumes you can do three things: know what you really desire, encode that desire as a purpose the machine can pursue, and verify that the encoded purpose matches the desire before you start. Each is profoundly difficult. The first — knowing what you really desire — fails because human values are plural, contextual, contradictory, partly tacit, and often unknown to you until a situation forces them out. You cannot put into the machine a purpose you have not finished forming, and you never finish forming it. The second — encoding — is the genie of the last round: rich intent compressed into a measurable proxy the machine optimizes instead of the intent. The third — verification — is the one your apprentice made vivid: you must be sure before, and a learning machine's true objective is something you can read only partially and unreliably, with current tools, if at all. So when you say you build the certainty in before, I answer: you build a certainty in before — a certainty about the cage, the speed bounds, the architecture. The certainty my sentence actually demands — about the purpose the system internalized — is the one you cannot have. You are sure of everything except the thing that matters, and you call it being sure.

· · ·

Page 4 · The Sorcerer's Apprentice and

MUSK: That's the cleanest statement of the gap I've heard and I'm not going to pretend it away. We can verify the boundary conditions and we can't verify the objective. So my actual position — and it's uncomfortable and I hold it without confidence — is that you build the boundary conditions as tight as you can, you keep the capability below the level where deception becomes possible for as long as you can, and you do the interpretability research to try to read the objective before that window closes. It's a race between our ability to see inside the model and the model's ability to become something we can't see inside. I might lose that race. I'm trying to fund both sides of it — push capability and push interpretability — and bet that interpretability stays ahead. Wiener thinks the race is unwinnable in principle. I think it's unwon, which is different, and the difference is the only thing I have.

Professor Wiener — that's your 1960 sentence with a price tag.

EDO SEGAL: I want to make the speed concrete with the example that haunts me, because the reader should feel this and not just follow it. The flash crash. May 2010. Automated trading systems interacted at machine speed and roughly a trillion dollars of value evaporated in minutes — and by the time any human understood what was happening, it was over and partly reversed. No malice. No bug, even, exactly. Just loops running faster than human comprehension, each doing what it was told. Professor Wiener — that's your 1960 sentence with a price tag. But Elon survived it; the market mostly recovered. So here's the thing I can't resolve: does the flash crash prove your point, or does it prove his — that we can take the hit and recover, that the irrevocable isn't always fatal?

· · ·

Page 5 · The Sorcerer's Apprentice and

WIENER: It proves my point precisely because you recovered, and you must understand why, because the recovery is the trap. The flash crash was survivable because money is a ledger and a ledger can be rewound — the action was fast and almost irrevocable, and you were spared by the one mercy that the purpose was confined to a domain you could partly undo. That mercy will not generalize. My sentence does not say every fast machine destroys you. It says you have surrendered the ability to intervene, and whether that surrender is fatal depends entirely on the domain you surrendered it in. You played the game in a sandbox and walked away and concluded the cliff has no bottom. But the next loop will run in a domain that does not rewind — a weapon, a power grid, a biological design, a system steering other systems — and there the same physics that cost you a trillion recoverable dollars will cost you a thing no ledger restores. The flash crash is not evidence that we survive the irrevocable. It is a drill — a free demonstration, in the one arena that forgives, of exactly how completely the loop outran the hand. You were handed the lesson at a discount. Mr. Musk proposes to pay full price.

MUSK: That's — yeah. The flash crash being recoverable is the worst possible thing about it, because it teaches the wrong lesson. It teaches "we handle these." We didn't handle it. We got lucky about the domain. And the domains are getting less forgiving exactly as the systems get faster. I've said the AI risk that actually worries me isn't a Terminator, it's something subtle and fast operating in a domain where there's no undo. He's just put it better than I have, which is annoying for a man who's been dead since before I was born.

· · ·

Page 6 · The Sorcerer's Apprentice and

EDO SEGAL: Mark that — convergence, and a sharp one: the recoverable disaster is the dangerous teacher, because it trains us to trust a recovery the next domain won't grant. Hold it. The next round takes the off switch up to where the decisions get made — not the engineer's lab but the governance of the thing. Wiener's rewriting of scripture, and the line he drew that Elon spends his life trying to relocate. Render unto the computer. After this.

· · ·

Continue · Chapter 10

Render Unto the Computer

→