GPT-3, MuZero, and AI Safety

Edited 2020/08/31 to remove an erroneous RNN comment.

I spent about six months in middle school being obsessed with The Secret Life of Dilly McBean by Dorothy Haas. It's a book about an orphan whose mad scientist parents gave him super magnetism powers before they died. When the book opens, he's been adopted and moved into a new town. Many magnetism adventures follow, including the appearance of some shadowy spy figures.

After many exasperating events, it's finally revealed that (spoiler) the horrible no-good Dr. Keenwit is trying to take over the world. How, you might ask? By feeding all worldly knowledge into a computer. Dr. Keenwit would then be able to ask the computer how to take over the world, and the computer would tell him what to do. The shadowy spy figures were out collecting training data for the computer.

Middle school me got a great dose of schadenfreude at the final scene where Dilly runs through the rooms and rooms of magnetic tape drives, wiping all of Dr. Keenwit's precious data with his magnetism powers and saving the world from a computer-aided dictator.


Dr. Keenwit would love GPT-3. It's a transformer network that was trained on an enormous amount of online text. Given the way the internet works these days, you could round it off to having been trained on all worldly knowledge. If Dr. Keenwit had gotten his evil hands on GPT-3, would even Dilly McBean have been able to save us?

The internet has been flooded with examples of what GPT-3 can (and can't) do. Kaj Sotala is cataloging a lot of the more interesting experiments, but a few of the biggest results are:

How is it doing those things? Using all of that source text, GPT-3 was trained to predict new text based on whatever came before it. If you give it the first half of a sentence, it will give you the second half. If you ask it a question, it will give you the answer. While it's technically just a text prediction engine, various forms of text prediction are the same as conversation. It's able to answer questions about history, geography, economics, whatever you want.

Even Tyler Cowen has been talking about how it's going to change the world. Tyler is careful to reassure people that GPT-3 is no SkyNet. Tyler doesn't mention anything about Dr. Keenwit, but I have to guess that he's not worried about that problem either.

GPT-3 isn't likely to cause the end of the world. But what about GPT-4? Or GPT-(N+1)?

GPT-3, on it's own, just predicts text. It predicts text like a human might write it. You could give it 1000 times more NN parameters and train it on every word ever written, and it would still just predict text. It may eventually be good enough for Dr. Keenwit, but it'll never be a SkyNet.


We don't have to worry about a GPT SkyNet because GPT isn't an agent. When people talk about agents in an AI context, that means something specific. An agent is a program that interacts with an environment to achieve some goal. SkyNet, for example, is interacting with the real world in order to achieve its goal of world domination (possibly as an instrumental goal to something else). Dr. Keenwit is interacting with society for the same goal. All people are agents, not all programs are agents.

This isn't to say that GPT-N couldn't be dangerous. A nuke isn't an agent. Neither is an intelligence report, but that intelligence report could be very dangerous if read by the right person.

But GPT is a bit more worrisome than an intelligence report or a history book. GPT can interact with you and answer questions. It has no goal other than predicting text, but in the age of the internet text prediction can solve an enormous number of problems. Like writing working software.

If you give GPT an input, it will provide an output. That means that you could feasibly make it into an agent by piping it's text output to a web browser or something. People are already proposing additions to GPT that make it more agent-y.

The thing is, GPT still has only one goal: predicting human generated text. If you give it access to a web browser, it'll just output whatever text a human would output in response to whatever is on the page. That's not something that's going to make complicated take-over-the-world plans, though it might be something that talks about complicated take-over-the-world plans.

What if we build structure around GPT-n to turn it into an agent, and then tweak the training objective to do something more active. Do we have SkyNet yet? Steve2152 over at LessWrong still doesn't think so. He comes up with a list of things that an Artificial General Intelligence (like SkyNet) must have, and argues that GPT will never have them due to its structure.

Steve2152's argument hinges on how efficient GPT can be with training data. The GPT architecture isn't really designed for doing things like matrix multiplication or tree search. Both of those things are likely to be important for solving large classes a problems, and GPT would be pretty inefficient at doing it. The argument then analogizes from being inefficient at certain problems to being unable to do other problems (similar to how standard DNNs just can't do what an RNN can do).

Instead of using a transformer block (which GPT uses), Steve2152 would have us use generative-model based AIs. In fact, he thinks that generative-model based AI is the only thing that could possibly reach a generalized (AGI) status where it could be used to solve any arbitrary problem better than humans. His generative-models seem to just be a group of different networks, all finding new ideas that explain some datapoint. Those models then argue among each other in some underspecified way until one single model emerges the winner. It sounds a lot like OpenAI's debate methods.

I'm not convinced by this generative-model based argument. It seems too close to analogizing to human cognition (which is likely generative-model sub-agents in some way). Just because humans do it that way doesn't mean it's the only way to do it. Furthermore, Steve2152's argument equates GPT with all transformer architectures, and the transformer can be used in other ways.

Transformers, more than meets the eye

Obviously an AI trained to generate new text isn't going to suddenly start performing Monte Carlo Tree Search. But that doesn't mean that the techniques used to create GPT-3 couldn't be used to create an AI with a more nefarious agent-like structure. Standard DNNs have been used for everything from object recognition to image generation to movie recommendations. Surely we can reuse GPT techniques in similar ways. GPT-3 uses a transformer architecture. What can we do with that?

It turns out we can do quite a lot. Nostalgebraist has helpfully explained how the transformer works, and he's also explained that it can model a super-set of functions described by things like convolutional layers. This means we can use transformers to learn even more complicated functions (though likely at a higher training expense). The transformer architecture is much more generalizable than models that have come before, which I think largely explains its success.

If we wanted SkyNet, we wouldn't even necessarily need to design control logic ourselves. If we connect up the output of the GPT-3 architecture to a web browser and tweak the cost function before re-training, we could use the same transformer architecture to make an agent.

It's not even clear to me that the transformer will never be able to do something like tree search. In practice, a transformer only outputs one word at a time. When you want more than one output word, you just repeat the output portion of the transformer again while telling it what it just output. (You can get a good example of what that looks like in this explainer). If you train a transformer to output sentences, it'll do it one word at a time. You just keep asking it for more words until it says that it's done by giving you an symbol. It seems possible to use this structure to do something like tree search, where the output it gives includes some kind of metadata that lets it climb back up the tree. You'd never get that with the training corpus that GPT-3 uses, but with the right training data and loss function it seems feasible (if very inefficient).

But if we're really worried about being able to do tree search (or some other specific type of computation) in our future SkyNet, then maybe we can just put that code in manually.

AlphaGo to MuZero

Hard coded agent-like structure is a large part of what made DeepMind's AlphaGo and it's descendants so powerful. These agents play games, and they play them well. AlphaGo and AlphaZero set world records in performance, and are able to play Go (a famously hard game) at superhuman levels.

The various Alpha* projects all used a combination of the game rules, a hand-coded forward planning algorithm, and a learned model that evaluated how "good" a move was (among other things). The planning algorithm iteratively plans good move after good move, predicting the likely end of the game. The move that is predicted to best lead to victory is then chosen and executed in the actual game. In technical terms, it's doing model based reinforcement learning with tree-search based planning.

By changing what game rules AlphaZero used, it could be trained to superhuman levels on Chess, Go, or Shogi. But each game needed the game rules to be manually added. When it needs to know where a knight would be allowed to move, AlphaZero could just consult the rules of chess. It never had to learn them.

Now DeepMind has released a paper on MuZero, which takes this to a new level. MuZero learns the game rules along with goodness of moves. This means that you can train it on any game without having to know the rules of the game yourself. MuZero achieves record breaking performance on board games and Atari games after automatically learning how the game is played.

With MuZero, the game rules are learned as a hidden state. This is pretty different from prior efforts to learn a game model from playing the game. Before this, most efforts emphasized recreating the game board. Given a chess board and a move, they'd try to predict what the chess board looks like after the move. It's possible to get decent performance doing this, but a game player built this way is optimizing to produce pictures of the game instead of optimizing to win.

MuZero would never be able to draw you a picture of a predicted game state. Instead, its game state is just a big vector that it can apply updates to. That vector is only loosely associated with the state of the actual game (board or screen). By using an arbitrary game state definition, MuZero can represent the game dynamics in whatever way lets it win the most games.

MuZero uses several distinct neural nets to achieve this. It has a network for predicting hidden game state, a network for predicting game rules (technically, game dynamics), and a network for predicting a move. These networks are all hand-constructed layers of convolutional and residual neural nets. DeepMind in general takes the strategy of carefully designing the overall agent structure, instead of just throwing NN layers and compute at the problem.

I'm a lot more worried about MuZero as a SkyNet progenitor than I am about GPT-3. But remember what we learned from Nostalgebraist above? The transformers that GPT-3 are based on can be used to learn more general functions than convolutional nets. Could GPT and MuZero be combined to make a stronger agent than either alone? I think so.

It's interesting to note here that MuZero solves one of the most common complaints from writers about GPT-3. GPT-3 often loses the thread of an argument or story and goes off on a tangent. This has been described as GPT not having any internal representation or goal. Prosaically, it's just generating text because that's what it does. It's not actually trying to tell a story or communicate a concept.

MuZero's learned hidden state, along with a planning algorithm like MCTS, is able to maintain a consistent plan for future output over multiple moves. It's hidden state is the internal story thread that people are wanting from GPT-3 (this is a strong claim, but I'm not going to prove it here).

I like this plan more than I like the idea of plugging a raw GPT-3 instance into a web browser. In general, I think making agent structure more explicit is helpful for understanding what the agent is doing, as well as for avoiding certain problems that agents are likely to face. The hand-coded planning method also bootstraps the effectiveness of the model, as DeepMind found when they trained MuZero with planning turned off and got much worse performance (even compared to MuZero trained with planning turned on and then run with planning turned off).


The main follow on question, if we're going to be building a MuGPT-Zero3 model, is what "winning" means to it. There are a lot of naive options here. If we want to stick to imitating human text, it sure seems like a lot of people treat "getting other people to agree" as the victory condition of conversation. But the sky is the limit here, we could choose any victory condition we want. Text prediction is a highly underconstrained problem compared to Go or Atari.

That lack of constrained victory condition is a big part of the AGI problem in the first place. If we're going to be making AI agents that interact with the real world to achieve goals, we want their goals to be aligned with our own human goals. That's how we avoid SkyNet situations, and we don't really know how to do it yet. I think lack of knowledge about useful value functions is likely the biggest thing keeping us from making AGI, aligned or not.

If we ask whether we can get AGI from GPT or MuZero, then we get into all sorts of questions about what counts as AGI and what kind of structure you might need to get that. If we just ask whether GPT and MuZero are a clear step towards something that could be dangerous on a global level (like SkyNet), then I think the answer is more clear.

We're getting better at creating models that can answer factual questions about the world based on text gleaned from the internet. We're getting better at creating models that can generate new text that has long duration coherency and structure. We're not yet perfect at that, but the increase in capability from five years ago is stunning.

We're also getting better at creating agents that can win games. As little as 6 years ago, people were saying that a computer beating a world-champion at go was a decade away. It happened 5 years ago. Now we have MuZero, which gets record-breaking scores on Atari games after learning the rules through trial an error. MuZero can match AlphaGo's Go performance after learning Go's rules through trial and error. This is also a stunning increase in game playing ability.

We don't have a good way to constrain these technologies to work for the good of humanity. People are working on it, but GPT-3 and MuZero seem like good arguments that capabilities are improving faster than our ability to align AI to human needs. I'm not saying that we need to run through the datacenters of DeepMind and OpenAI deleting all their data (and Dilly McBean's magic magnetism powers wouldn't work with contemporary storage technology anyway). I am saying that I'd love to see more emphasis on alignment right now.

There are a few different organizations working on AI alignment. OpenAI itself was originally formed to develop AI safely and aligned with human values. So far most of the research I've seen coming out of it hasn't been focused on that. The strongest AI safety arguments I've seen from OpenAI have been Paul Christiano basically saying "we should just build AGI and then ask it how to make it safe."

In all fairness to OpenAI, I haven't tracked their research agenda closely. Reviewing their list of milestone releases reveals projects that seem to emphasize more powerful and varied applications of AI, without much of a focus on doing things safely. OpenAI is also operating from the assumption that people won't take them seriously unless they can show they're at the cutting edge of capabilities research. By releasing things like GPT, they're demonstrating why people should listen to them. That does seem to be working, as they have more political capital than MIRI already. I just wish they had more to say about the alignment problem than Paul Christiano's blog posts.

In fairness to Paul Christiano, he thinks that there's a "basin of attraction" for safety. If we build a simple AI that's in that basin, it will be able to gradient descend into an even safer configuration. This intuitively makes sense to me, but I wouldn't bet on that intuition. I'd want to see some proof (like an actual mathematical proof) that the first AGI you build is starting in the basin of attraction. So far I haven't seen anything like that from Paul.

DeepMind, on the other hand, was founded to create a general purpose AI. It wasn't until Google bought it that they formed an internal ethics board (which apparently has a secret membership). They do have an ethics and society board (separate from their internal ethics board) that is also working on AI safety and human alignment (along many different axes). It seems like they're taking it seriously now, and they have a fairly active blog with detailed information.

MIRI is working exclusively on AI safety, not on capabilities at all. They've released some papers I find pretty intriguing (especially related to embedded agency), but publicly visible output from them is pretty sporadic. My understanding is that they keep a lot of their work secret, even from other people that work there, out of fear of inspiring larger capability increases. So I basically have no idea what their position is in all this.

All of this leaves me worried. The ability to create AGI seems closer every year, and it seems like we're making progress on AGI faster than we are making progress on friendly AI. That's not a good place to be.