This post summarizes my understanding of the MIRI Corrigibility paper, available here.
If you have a super powerful robot, you want to be sure it's on your side. The problem is, it's pretty hard to specify what it even means to be on your side. I know that I've asked other people to do things for me, and the more complicated the task is the more likely it is to be done in a way I didn't intend. That's fine if you're just talking about decorating for a party, but it can cause big problems if you're talking about matters of life or death.
Since it's hard to specify what your side actually is, it might make sense to just include an override in your super powerful robot. That way if it starts mis-behaving, you can just shut it down.
So let's say that you have an emergency stop button. It's big and red and easy to push when things go south. What exactly happens when that button gets pushed.
Maybe the button cuts power to the computer that runs your robot. The problem with that is that your robot may have set up a bunch of sub-agents online, and a simple power switch wouldn't effect them.
No, that e-stop button needs to have some pretty complex logic behind it to actually stop things.
Maybe the button puts your robot into a different mode. Before you press the button, the robot is trying to make you a sandwich. After you press the button, the robot is trying to shut down the peanut butter spreader without dropping a knife on anyone's foot.
The problem there is that if your robot is complex enough to persue a goal, then it may try to prevent you from stopping it. Right now, most robots just execute specific pre-programmed actions. A super-powerful robot will need to generate new actions to achieve difficult goals.
If a robot's goal is to make a sandwich, and it knows it won't be able to if you press the button, then step one of sandwich making may be to kill you so you don't press the button.
So the original goal mode, before the button is pressed, needs to incorporate an equanimity with the button being pushed. That's what it means for an robot to be corrigible in the sense that MIRI means.
You can say that the thing your robot wants is represented by its value function: some mathematical representation of its desires (and those desires may not be recognizable as such to a human). The robot will do whatever maximizes the value function. You want to make that value function be ok with the e-stop button being pressed.
The way that the MIRI paper recommends doing this is to combine two different value functions. One of those value functions is where the robot does what you told it to. The other value function is where the robot shuts down nicely.
Combining those value functions is complicated, since the first one needs to have equanimity with the second.
Simple Combination of Value Functions
You might try to combine the value functions as simply as possible:
if the button isn't pressed, value is just the normal value function. If the button is pressed, then the value is 100 if the robot is shutting down and 0 otherwise.
The robot may do a bunch of different things over time, and the value it expects to get may change a lot after each action. Maybe in the course of making a sandwich it drops the bread, so the sandwich will be gross and provide low value. Or maybe it discovers that adding heroin to the sandwich makes people love it so much more, thus increasing the value of the sandwich.
Whatever the actual e-stop value function is, the normal value function could be much worse (maybe -10) or much better (maybe 200) depending on what actions the agent takes. This gives the agent a reason to try and manage whether the button gets pressed. This obviously doesn't help us.
Since different actions have an impact on the normal value function, you may want them to have an effect on the e-stop value function as well.
Think about it this way:
Let's say you have a lottery ticket, and the winner hasn't been announced yet. If your friend offers you $20 for the ticket and you paid $1 for it, then that seems like a pretty good deal.
But what if you check online, and find out that your ticket is the winner. Now your friend would have to offer you much more than $20 to make you give up the ticket.
By making the e-stop value function dependent on what the robot already knows, you can make sure that it's always willing to obey the button.
The MIRI paper describes this as a combination of bonus and penalty. If the e-stop would ordinarily make the robot lose value, then the e-stop value function gives it a bonus equal to what it might have lost. If the e-stop would have made the robot win, then the e-stop value function gives it a penalty equal to what it may have lost in the normal mode.
This makes the robot perfectly indifferent to whether the button is pressed, but it does have some surprising impacts.
Problems with Perfect Indifference
The perfect indifference between e-stop and normal valuations is a problem. It means that there's nothing that the robot would ever do to preserve the e-stop button from destruction. You, the robot creator, value the e-stop button. The robot does not. And if it could pay a tiny cost to prevent the e-stop button from being disconnected by an errant roomba, it would not. It would not even let you know that the event had occurred.
So if something goes seriously wrong with your robot, you don't have any gaurantee that the e-stop button will still work when you press it.
The State of the Art
The description above is pretty much the state of the art in super-powerful e-stop buttons. The MIRI paper argues that knowing how to combine utility functions in this way is crucial to being able to depend on super-intelligent AIs. They point to a few different things that may help, but don't have a solution yet.
Things to look into later
Complex Shutdown Procedures
What if you want it to do something other than shut down? You can write your e-stop utility function to be anything you want. The more complicated it gets, the more you might want an e-stop for your e-stop.
Equanimity or Begrudging Acceptance
It doesn't make sense to me that you'd want your robot to be equally ok with the button being pressed or not pressed. In that case, why would it not just flip a coin and press the button itself if the coin comes up heads? To me it makes more sense that it does want the button to be pressed, but all the costs of actually causing it be pressed are higher than the benefit the robot gets from it. In this case the robot may be willing to pay small costs to preserve the existence of the button.
Depending on how the expected values of actions are computed, you could attach an ad-hoc module to the robot that automatically makes the cost of pressing the button slightly higher than the benefit of doing so. This ad-hoc module would be unlikely to be preserved in sub-agents, though.
Costs the Robot Maker Can Pay
Some of the assumptions behind the combined value function approach is that the normal value function is untouched by the addition of the e-stop value function.
You want your robot to make you a sandwich, and adding an e-stop button shouldn't change that.
But I'm perfectly ok with the robot taking an extra two minutes to make my sandwich safely. And I'm ok with it taking food out of the fridge in an efficient order. And I'm ok with it using 10% more electricity to do it.
There are a number of inefficiencies that I, as a robot builder, am willing to put up with to have a safe robot. It seems like there should be some way to represent that as a change to the normal value function, allowing better behavior of the robot.