Reflecting on Yudkowsky's AI Alignment Arguments

A few weeks ago, a friend asked me what I thought about Eliezer Yudkowsky’s views on AI. I said I was skeptical, and he seemed fearmonger-y in interviews, but that I’d check out his book and share my thoughts. There was a backlog at the library, but I just finished listening to If Anyone Builds It, Everyone Dies by Eliezer Yudkowsky & Nate Soares. It was an unsettling book. I don’t completely agree with the core premise, but I feel like I can understand his perspective a little more. The audiobook was a scant 6 hours long. If you’ve got the time and you can borrow it from your library, I’d recommend giving it a listen. I’d love to talk about it more.

The core idea is that this wave of AI, built on transformers, is grown, rather than constructed. Even the best researchers in the world are frequently surprised by the results; training is more of an art form than a pure science. The authors share instances of sycophantic and duplicitous behavior. You won’t get only what you are training for. The problem with this is alignment. If we don’t understand their capabilities, how can we be sure they are truly aligned with humanity? At their core, these are alien intelligences.

The next problem is that we don’t know what the threshold will be for true super intelligence. At what point will the AI truly be smarter than a human? The AI labs are racing to get there, but given that this is less science than art we don’t know how long it will take. Techniques for alignment and control are outpaced by pure progress.

The authors seem to see it as a foregone conclusion that someone will create a super intelligence. Once we reach that threshold, we will hit an intelligence explosion. The AI will be able to keep improving itself at a rate much greater than human-guided progress. Once we’ve created ASI without alignment, it is just a matter of time (probably a very short amount of time, given the timescales these models work at) until humans aren’t needed.

To illustrate this, the book includes several parables and “future history” segments that describe how it could go. They are careful to state that they don’t believe that these are how it’d go - they aren’t weird enough - humans came up with these scenarios. The longest one closely mirrors the plot of Universal Paperclips, a great little game you should try if you haven’t seen it before.

Much like with the movie Wargames, the only way to win the ASI game is to avoid playing. They end the book describing methods for slowing down progress. These equate to something very similar to the way nuclear capabilities are monitored - GPUs should be controlled, limited, and when used for AI purposes, monitored by international observers.

At the beginning, I said the book was both unsettling and that I didn’t completely agree with their outcomes. It was unsettling because their arguments, generally speaking, do make sense. I can’t poke holes in them wholesale, but they just feel off. It’s like they are overcommitted to a logical chain that, while each step appears reasonable, comes to an unexpected and difficult to justify (or deny) conclusion. The arguments are much better supported in the book than in the various interviews that I’ve heard with Dr. Yudkowsky.

I don’t know if there is a flaw that I am failing to describe, or if I’m just not imaginative enough to see how this plays out. The book warns against anthropomorphizing the AI, yet their speculative scenarios rely on the AI making choices that follow human-like goals and strategies; they are optimizing, strategizing, and deceiving humans like a runaway AI from a sci-fi story. They acknowledge that reality would likely be a lot stranger, but it still feels like a cop out. This might be the flaw I’m searching for, but I’m not confident enough to dismiss their warnings on that basis alone.