## Corrigibility and Decision Theory

Edited for clarity and style 2018/03/12.

Soares et al. argue that an AI acting to fulfill some utility function given to it by humans may not behave as humans would want. Maybe the utility function specified doesn't match human's actual values, or maybe there's a bug in the AI's code. In any case, we as AI designers want to have a way to stop the AI from doing what it's currently doing.

Naively, one might expect to just be able to hit the off-switch if the AI starts misbehaving. Unfortunately, a sufficiently smart AI may foresee it's creator attempting to turn it off. If it does, it may seek to disable its off switch or manipulate its creator in some way. An AI that respects its off-switch, and doesn't try to get around it, is called corrigible.

# The Corrigibility-Wrapper

To create an AI that's corrigible, Soares et al. propose a kind of wrapper around a utility function that makes the utility function corrigible. In their ideal case, any utility function could be placed in the wrapper and made suddenly corrigible. The type of wrapper that they propose just checks the state of the off-switch, then returns either the normal utility function's value if the off-switch isn't pressed. If it is pressed, it returns a utility that's proportional to how off the AI is. More complicated functions of the original utilities are also possible.



In the above utility wrapper,  is the normal utility function that we want the AI to pursue. We have  as a utility function that incentivizes shutting down. The value  is a function of the world state, and serves to allow the resulting utility function to trade-off between the two options in some way.

In their paper, Soares et al. analyze several different ways to make such a wrapper work (including various ways of choosing the function ). Most of the obvious ways to make such a utility wrapper suffer from some severe flaws. Some corrigibility-wrappers cause an AI to want to turn itself off, others don't successfully prevent the AI from manipulating human observers.

Another paper by Armstrong argues for constructing a wrapper that makes the AI indifferent to being turned off or left on: a situation called utility indifference. Utility indifference is achieved by setting the function  above such that the expected value  equals the expected value  at every decision point. In this case, the AI will turn off if the switch is pressed, but it won't actually try to preserve the switch's action. If a wandering gremlin cuts the wires to the off-switch, the AI wouldn't care and wouldn't seek to alert a human about it. The AI thinks it can still achieve the same amount of value as it could before the switch malfunctioned.

# Moral Uncertainty

Hadfield-Menell et al. confront this corrigibility problem in a different way. Instead of creating a wrapper around a utility function, they argue that the entire point of the problem is that humans know a utility function and the AI doesn't. As far as Hadfield-Menell et al. are concerned, the AI only knows an approximation of a utility function. The AI is uncertain about what true value is, and we say it has moral uncertainty.

Their insight is that if you correctly inform the AI that you know better, it suddenly becomes very interested in paying attention to the switch because the switch is one good way to learn about the true utility of different actions. Because the switch is informational, the AI also wants to preserve the function of the switch.

The AI may use it's utility-approximating function to come up with some candidate actions. Since it knows that humans have a better understanding of utility, it will seek a human's guidance before carrying out any of those candidate actions. Hadfield-Menell et al. show one concrete game in which this works properly.

In the game modeled by the above graph, a robot can make a decision to act, turn off, or consult a human. If consulted, the human can then either turn the robot off or not. If the robot is not turned off after consulting a human, then it may take its original action. In this game, the robot is uncertain about the true value of utility (), which may be smaller or greater than . If the AI chooses to take an action, then it may get a negative value or a positive value (it isn't sure). It therefore isn't sure if it's better to turn off or to take the action.

The human knows (approximately) the true value.

In their paper, Hadfield-Menell et al. analyze how accurate the human's belief about the true utility needs to be for the AI to pay attention to the human. They also look at how uncertain the AI needs to be about the true utility to pay attention to the human. They find that there are certain situations in which the AI will attend to the human, and certain situations in which an AI will not.

This problem is explicitly a one-shot game. Can we generalize from the one-shot game to say that an (uncertain enough) AI would pay attention to a (correct-enough) human in an iterated game?

# Problems With Moral Uncertainty

Over on Arbital there's an argument that moral uncertainty won't lead to corrigibility. This is basically a direct argument against Hadfield-Menell et al.'s approach applying in the general case. Hadfield-Menell et al. said "if the AI knows the human has a better understanding of utility, it will pay attention to human directions."

In response, the Arbital post says "if the AI thinks it can eventually approximate the human's utility function better than the human can, it may still avoid being turned off because it thinks it will someday know better than the human." The arbital post is arguing that, even if the human knows better now, the AI may still ignore the human now.

If the AI is eventually going to be better than the human at creating a good world, then maybe we should just let it run. But the problem is that maybe there's just a bug in the code, or maybe the AI will cause some serious catastrophes before it gets to "eventually". So even if the AI thinks it will eventually have a good enough understanding of the human's utility function, we still want it to turn off when we tell it to.

The Arbital post ends with a vague argument that decision theory could help to refine the moral uncertainty method of corrigibility. Let's see how this could work.

## Decision Theory, Moral Uncertainty, and the Off-Switch Problem

Let's take the Arbital post at face value and assume that we can eventually solve corrigibility with decision theory. To start us off, let's translate Hadfield-Menell et al.'s off-switch game into an iterated decision theory dilemma. We'll follow Yudkowsky's choice of using Pearl's causal models to create a graph of the decision problem. I'll also abuse that notation a bit, by representing states of partial knowledge as probabilities on the graph edges (this is a bit different than is done in most graphical models, but it saves having to draw extra nodes for hidden and observed variables).

The above diagram represents Hadfield-Menell et al.'s off-switch game from the perspective of the AI. The square node depicts the decision the AI is making about what action to take. The diamond node represents the value achieved from whatever the AI does (turn off or take an action). The round nodes are other factors about the world that aren't under the AI's control. Arrows in this graph represent causality in some way. For example, the state of the switch may causally impact what the AI decides to do.

The True Utility is a fact about the world that both the AI and the human have partial access to. The human can set the state of the off-switch, and the AI can observe it. The value that the AI can achieve is a function of both the AI's action and the True Utility.

In the diagram, the factors  and  represent how well the AI thinks that it knows the True Utility, and how well the AI thinks the human knows the True Utility. Hadfield-Menell et al. assume value has a Normal distribution with some mean and variance. This lets them specify  and  exactly, and to calculate exact behavior of the AI if those parameters take on different values.

Again, note that both  and  are describing only the AI's knowledge about the world, and  does not represent what the human actually thinks.

If we assume that the above decision game is played once, it is equivalent to Hadfield-Menell et al.'s formulation.

Now let's assume that the above decision game is one step in a repeated chain. At each time step, both the human and the AI have to make similar decisions. They know what happened in the past, yet they face the same decision again now. Let's also assume that the AI has multiple available actions at each timestep, and some actions have positive value and some have negative value.

In this case, we can unroll the diagram above to show multiple steps of the game. In each step, the off-switch state considered by the AI depends only on past information.

In the repeated case, let's say that the probability distribution the human has over the true utility assigns a high probability to what is actually the true utility (the human is well trained and thoughtful).

The human may improve their understanding of True Utility. Imagine that the human expects to get 10 utilons from eating pizza, and then takes actions that lead to eating pizza. By paying attention to how much they actually like it, they can adjust their model of pizza's goodness. The AI can do the same thing by observing the reactions of the human.

This corresponds to distribution for  and  that are functions of time, where (for example) variance decreases in Hadfield-Menell et al.'s Normal distribution.

After each iteration of the above game, the confidence that the AI has in its own utility-approximation grows. That means that the information it gets from the off-switch state decreases in value. It slowly needs the off-switch less and less, and the AI knows it. In Hadfield-Menell et al.'s original problem, the AI can see exactly what the true value is after one action (assuming it doesn't turn off). In that case, all further actions can be made with complete knowledge.

If we assume that only partial information is available to update the distributions for the utility estimates (for both human and AI), then it will take longer for the AI to get to the point that it has a better estimate than the human for true value. As long as we assume that the AI is better at integrating new information and updating its probabilities, we can predict that the AI will eventually have a better understanding of true utility than the human.

If the AI does some prediction to see what action now makes sense, it will see that it will reach a state where it knows just as well as the human does. Since shutting down means that it gets 0 value, whereas continuing allows it to possibly get arbitrarily high value (assuming any actions with positive value exist), it may take an expected hit in value now so that it can gain a lot later.

# Functional Decision Theory

The insight of Hadfield-Menell et al. (and the rest of the moral uncertainty crowd) is that we don't want the AI to be modeling the off-switch as a crucial piece of the world for its own sake. We want the AI to see the off-switch as a source of very important information about the world; preferably information that it can't obtain in another way.

This fails in the above decision theory game because the AI doesn't adequately represent the fact that a human switch operator knows that the AI will predict having a good utility approximation eventually. If a human presses the off-switch, they do it knowing that the AI wants to get high utility and expects to be able to do better in the future. We want to change the above decision problem so that the AI can take this information into account.

Ideally, we can have the AI think to itself as follows: "I know that I could do better in the future if I keep going, and the human knows that too. But the human still pressed the button, so there must be some reason to shut down, even knowing that I'll be better at this later."

There is a standard decision theoretic problem known as Death In Damascus that can help us out here.

Death In Damascus

A merchant in Damascus meets Death in the market one day. Death says to the merchant "hello, I'll be coming for you tomorrow."

The merchant knows death works from an appointment book that specifies with perfect accuracy when and where someone will die. Knowing that Death is in Damascus, the merchant can choose to stay in Damascus and spend their last night with their family (which they value at $1000). Alternatively, the merchant can flee to Aleppo. If the merchant manages to be in a different city from Death on the day they would otherwise die, then the merchant gets to live forever. They value that outcome at$1 million. Should the merchant stay in Damascus or flee?

The above problem description is adapted from Yudkowsky and Soares's Functional Decision Theory paper.

In this case, the merchant sees four potential outcomes:

1. The merchant stays in Damascus. Death stays in Damascus. Total value: $1000 2. The merchant stays in Damascus. Death goes to Aleppo. Total value:$1001000
3. The merchant flees to Aleppo. Death stays in Damascus. Total value: $1000000 4. The merchant flees to Aleppo. Death goes to Aleppo. Total value:$0

To represent this using Causal Decision Theory, we'll use the formulation from Cheating Death in Damascus.

Much like the decision diagram above, the square box represents the decision that the merchant makes (in this case whether to stay or flee). The diamond box is the ultimate value they get from the world-state that results from their actions. The round nodes are other facts about the world, with arrows indicating causality.

When the merchant thinks "I will go to Aleppo", the merchant knows that their predisposition is to go to Aleppo. They know that the appointment book accurately predicts their predisposition. They thus decide to stay in Damascus, but that leads them to realize that their predisposition is to stay in Damascus. So then they think they should go to Aleppo. The merchant is unable to form a stable decision in this problem.

A causal decision theory cannot adequately deal with the situation, because it cannot account for the fact that Death's appointment book is accurately predicting any decision made by the merchant.

Yudkowsky proposes Functional Decision Theory as a new method of making decisions that does account for this. Crucially, FDT can formally represent the known fact that Death's appointment book is always accurate. Because of that, FDT can accurately rule out options where Death is in a different city than the merchant on their death-day. Therefore, the merchant only has choices available with values of $1000 or$0, and the decision is easy.

In the above graph, the merchant is not deciding what act to perform by imagining doing different acts. The merchant is deciding what act to perform by imagining different ways of deciding. That difference is indicated by the square decision box being the predisposition/FDT() box instead of the merchant's act.

An FDT merchant thinks "the output of my own decision procedure is what drive's death to be in a specific location. It is therefore impossible to escape death tomorrow. Because of that, the best I can do is wait in Damascus and meet my death tomorrow."

## AI Corrigibility via FDT

To map Death In Damascus to the AI corrigibility problem, imagine that the AI is the merchant. The human pressing the off-switch is equivalent to the merchant hearing Death's message "I'm coming for you tomorrow." The Death in Damascus problem has the message-bearer and the source of low utility as the same agent (Death). In the AI/human case, the message bearer is the human, but the source of low utility is just the world in general after the AI takes some actions.

Now translate "stay in Damascus" to "shut down properly". One day the AI sees the human press the switch. The AI then sees four options available to it.

1. The AI shuts down nicely. The AI does not cause a catastrophe due to future actions. Total value: $1000 2. (there's no analog for option 2 in the AI story, since the AI shutting down precludes AI-caused catastrophes) 3. The AI continues operating. The AI does not cause a catastrophe due to future actions. Total value:$1001000
4. The AI continues operating. The AI does cause a catastrophe due to future actions. Total value: $0 We represent this problem with the same functional graph as Death In Damascus, above. In this case, the human is simulating the behavior of the AI to predict catastrophic outcomes. When the AI sees the switch has been pressed, it knows that the human has already simulated the AI's behavior. Any option the AI chooses aside from shutting down has been seen by the human to lead to catastrophe, so the AI knows that it is choosing only from options 1 and 4. Option 3 is not available to it. There are several natural objections to this: 1. a human can't predict a super-human AI 2. why would the AI believe that the human has a good idea of what an AI would consider a catastrophe, even given that the human is able to predict the AI? Let's address the second objection first. We integrate moral uncertainty, similar to Hadfield-Menell et al., into the problem in order to represent the idea that the AI needs to learn a utility function from the human. This graph includes the moral uncertainty of Hadfield-Menell et al.'s approach, but relies on Functional Decision Theory to adequately represent the information that the human is taking into account when throwing the switch. The above graph represents the distributions  and  in the same way as in Hadfield-Menell et al.'s example. The distribution  represents how accurately the human can predict the AI's decision process. To make the use of the off-switch more clear, it's also possible to unroll the diagram and show two different time steps at once. The following diagram shows the off-switch state as being based only on prior actions (not the action currently being considered). In Hadfield-Menell et al.'s original decision theory problem, the only thing that mattered was what the AI thought the knowledge parameters were. The AI knew how well it could represent true utility, how well the human could represent true utility, and that was enough to solve the problem. In the iterated FDT case, the human is also simulating the AI (as well as they can). Therefore, the distributions , , and  need to be known to both human and AI. In the following, we'll assume that the human and AI magically have common knowledge of those parameters. This formulation of the corrigibility problem explicitly represents the fact that human values and AI values are linked. If the human thinks an outcome will have very low value, that actually does give information to the AI about what the AI values. Our answer to objection 2 (regarding value) leads to an intuition for objection 1 (regarding predicting super-human AI). The human is unable to accurately predict what move a super-human AI will make. This does not mean that the human has a uniform distribution over AI actions. The human knows that the AI will seek to accomplish what the AI thinks has high value. Therefore, the human can predict that anything the AI does will move it towards more of what the AI considers valuable. This gives the human some predictive power, so the switch may remain informative to the AI. Colloquially, the AI will attend to the switch and react correctly if it thinks that the user of the switch can adequately help the AI to avoid situations that the AI would agree are terrible. As long as this is true, the AI would also attempt to improve the fidelity of the information from the switch (this corresponds to taking actions that make , , and  more accurate). Morally uncertain FDT AI lend credence to Paul Christiano's idea of a "basin of corrigibility", given that they will attempt to improve a human's understanding of itself and of true value. # Next Steps and Thoughts The above Functional Decision Theory argument is just an intuitive sketch. It seems clear that there are some values of  and  that disagree enough that the AI would no longer trust the human. It also seems clear that, if the human has a poor enough understanding of what the AI is going to do then the AI would also not listen to the human. At this point, it seems worth repeating a variant of Hadfield-Menell et al.'s off-switch game experiments on an FDT agent to determine when it would pay attention to its off-switch. ## Safely Combining Utility Functions Imagine you have two utility functions that you want to combine:  and  In each case, the utility function is a mapping from some world state to the real numbers. The mappings do not necessarily pay attention to all possible variables in the world-state, which we represent by using two different domains, each an element of some full world state (). By  we mean everything that could possibly be known about the universe. If we want to create a utility function that combines these two, we may run into two issues: 1. The world sub-states that each function "pays attention to" may not overlap (). 2. The range of the functions may not be compatible. For example, a utility value of 20 from  may correspond to a utility value of 118 from . # Non-equivalent domains If we assume that the world states for each utility function are represented in the same encoding, then the only way for  is if there are some dimensions, some variables in , that are represented in one sub-state representation but not the other. In this case, we can adapt each utility function so that they share the same domain by adding the unused dimensions to each utility function. As a concrete example, observe the following utility functions:  red marbles   blue marbles  These can be adapted by extending the domain as follows:  red marbles,  blue marbles   red marbles,  blue marbles  These two utility functions now share the same domain. Note that this is not a procedure that an be done without outside information. Just looking at the original utility functions doesn't tell you what those sub-utility functions would prefer given an added variable. The naive case is that the utility functions don't care about that other variable, but we'll later see examples where that isn't what we want. # Non-equivalent valuations The second potential problem in combining utility functions is that the functions you're combining may represent values differently. For example, one function's utility of 1 may be the same as the other's utility of 1000. In simple cases, this can be handled with an affine transformation. As an example, from our perspective of  and ,  should be valued at only 2 times  instead of the 10 times as is shown above. One of the ways that we can adapt this is by setting . Note that non-equivalent valuations can't be solved by looking only at the utility functions. We need to appeal to some other source of value to know how they should be adapted. Basically, we need to know why the specific valuations were chosen for those utility functions before we can adapt them so that they share the same scale. This may turn out to be a very complicated transformation. We can represent it in the general case using arbitrary functions  and . # Combining Utility Functions Once we have our utility functions adapted so that they use the same domain and valuation strategy, we can combine them simply by summing them.  The combined utility function  will cause an agent to pursue both of the original utility functions. The domain extension procedure ensures that the original utility functions correctly account for what the new state is. The valuation normalization procedure ensures that the original utility functions are valued correctly relative to each other. # A more complicated case Let's say that you now want to combine two utility functions in a more complex way. For example, lets say you have two utility functions the use the same valuation and domain:   Let's say our world is such that  corresponds to a location on a line, and . One of the utility functions incentivizes an agent to move up the line, the other incentivizes the agent to move down the line. These utility functions clearly have the same domain, and we're assuming they have the same valuation metric. But if we add them up we have utility 0 everywhere. To combine these, we may wish to introduce another world-state variable (say  for switch). If  then we want to use , and if  then we want to use . You could think of this as "do something when I want you to, and undo it if I press the button." One way that we could do this is to extend each utility function to include the new state variable, and set the utility of the function to 0 in the half of the new domain that we don't want it to be active. To do this, we could create:  if  else   if  else  When we sum these adapted utility functions, we find that we have a nice utility function that incentivizes the agent to move towards 2 if the switch is on and to move towards -2 if the switch is off.  Except that there's a pathological case hiding out here. What if the agent can control the state of the switch? In that case, an agent that finds itself starting out at state (n=2, s=0), may just flip the switch rather than moving. ## Agents aren't allowed to throw the switch The behavior of an agent with utility function  is clearly not what we want. To fix it we have a few options: 1. change how we combine its components  and , perhaps by weighting them in some way 2. introduce one or more other component utility functions 3. change the state space upon which the functions operate ## Weightings on  and  Many of the attempts to change the weights of the component functions will lead to un-intended consequences. If you start the agent in state (n=-2, s=1), then you can incentivize it not to press the button by weighting the corresponding component utility function highly, for example use . But this just makes the reverse problem worse. If the agent reaches (n=2, s=1) and an operator decides to flip the switch to cause the agent to go in reverse, then the agent now has even more incentive to flip the switch back. MIRI's corrigibility paper proves a theorem that is analogous to this (Theorem 1 in the paper). ## Including "correction" component utility functions A related approach (also from MIRI's corribility paper) to preventing the agent from actuating the switch on its own is called utility indifference. In this formulation, the combined utility function of the agent includes a third utility function,  whose only purpose is to make the agent agnostic to flipping the switching. It does this by exactly counteracting any utility that the agent may gain by changing the state of the switch. This function therefore depends on the full state.  In the 1-dimensional line problem described above, the correction utility function would be defined as:  if  remains the same,  if  and is about to change, and  if  and is about to change This correction function makes the value of flipping the switch exactly equivalent to the value of the current state, so the agent will be indifferent about it. In this example, there's always a better thing to do (unless you're at an end of the line). The correction function as described in MIRI's paper also has a dependence on other timesteps (it's not markov). Furthermore, the paper describes some serious issues with it. ## Adjusting the state space of the component utility functions By adjusting the state space of the component utility functions, we can define a utility function that correctly values what we think we value. Ultimately, what we care about is that the agent does not change the switch state itself. If we can represent this as a part of the world state, then we can do a domain extension on the original utility components. Let$i\$ be a variable representing, roughly "the agent will pursue a strategy to change the state of the switch". Then we can construct new utility components as follows:

 if  else 
 if  else 

If we further care that the agent doesn't do anything to tamper with the switch, or to manipulate people into treating the switch in one way or another, these cares can be dealt with in the same way. Construct a world-state representation that allows the agent to model its own impact, and then correctly domain extend the component utility functions.

To a large extent, this passes the buck from creating good value functions to determining how an agent can create intentional models of itself. I think this is a good change in perspect for two reasons.

1. Changing the domain of the utility function accurately captures what we care about. If we're attempting to adjust weights on the original utility functions, or add in compensating utility functions, then we are in some sense attempting to smuggle in a representation of the world that's not contained in our original world-state. We actually do care about whether the agent has an intention of flipping the switch. The only reason not to make the agent care about that also is if its not feasible to do so.

2. Figuring out how to get an agent to model its own intentions is a problem that people are already working on. The actual problem of representing an agents intention to flip the switch reminds me of one-boxing on Newcomb's problem, and I'm curious to explore that more. Using an agents representation of itself as part of its world model seems intuitively more tractable to me.

The main question left is "how do you create a utility function over the beliefs of the agent?"