Safely Combining Utility Functions

Imagine you have two utility functions that you want to combine:


In each case, the utility function is a mapping from some world state to the real numbers. The mappings do not necessarily pay attention to all possible variables in the world-state, which we represent by using two different domains, each an element of some full world state (). By we mean everything that could possibly be known about the universe.

If we want to create a utility function that combines these two, we may run into two issues:

1. The world sub-states that each function "pays attention to" may not overlap ().
2. The range of the functions may not be compatible. For example, a utility value of 20 from may correspond to a utility value of 118 from .

Non-equivalent domains

If we assume that the world states for each utility function are represented in the same encoding, then the only way for is if there are some dimensions, some variables in , that are represented in one sub-state representation but not the other. In this case, we can adapt each utility function so that they share the same domain by adding the unused dimensions to each utility function.

As a concrete example, observe the following utility functions:

red marbles
blue marbles

These can be adapted by extending the domain as follows:

red marbles, blue marbles
red marbles, blue marbles

These two utility functions now share the same domain.

Note that this is not a procedure that an be done without outside information. Just looking at the original utility functions doesn't tell you what those sub-utility functions would prefer given an added variable. The naive case is that the utility functions don't care about that other variable, but we'll later see examples where that isn't what we want.

Non-equivalent valuations

The second potential problem in combining utility functions is that the functions you're combining may represent values differently. For example, one function's utility of 1 may be the same as the other's utility of 1000. In simple cases, this can be handled with an affine transformation.

As an example, from our perspective of and , should be valued at only 2 times instead of the 10 times as is shown above. One of the ways that we can adapt this is by setting .

Note that non-equivalent valuations can't be solved by looking only at the utility functions. We need to appeal to some other source of value to know how they should be adapted. Basically, we need to know why the specific valuations were chosen for those utility functions before we can adapt them so that they share the same scale.

This may turn out to be a very complicated transformation. We can represent it in the general case using arbitrary functions and .

Combining Utility Functions

Once we have our utility functions adapted so that they use the same domain and valuation strategy, we can combine them simply by summing them.

The combined utility function will cause an agent to pursue both of the original utility functions. The domain extension procedure ensures that the original utility functions correctly account for what the new state is. The valuation normalization procedure ensures that the original utility functions are valued correctly relative to each other.

A more complicated case

Let's say that you now want to combine two utility functions in a more complex way. For example, lets say you have two utility functions the use the same valuation and domain:

Let's say our world is such that corresponds to a location on a line, and . One of the utility functions incentivizes an agent to move up the line, the other incentivizes the agent to move down the line. These utility functions clearly have the same domain, and we're assuming they have the same valuation metric. But if we add them up we have utility 0 everywhere.

To combine these, we may wish to introduce another world-state variable (say for switch). If then we want to use , and if then we want to use . You could think of this as "do something when I want you to, and undo it if I press the button."

One way that we could do this is to extend each utility function to include the new state variable, and set the utility of the function to 0 in the half of the new domain that we don't want it to be active. To do this, we could create:

if else
if else

When we sum these adapted utility functions, we find that we have a nice utility function that incentivizes the agent to move towards 2 if the switch is on and to move towards -2 if the switch is off.

Except that there's a pathological case hiding out here. What if the agent can control the state of the switch?

In that case, an agent that finds itself starting out at state (n=2, s=0), may just flip the switch rather than moving.

Agents aren't allowed to throw the switch

The behavior of an agent with utility function is clearly not what we want. To fix it we have a few options:

1. change how we combine its components and , perhaps by weighting them in some way
2. introduce one or more other component utility functions
3. change the state space upon which the functions operate

Weightings on and

Many of the attempts to change the weights of the component functions will lead to un-intended consequences.

If you start the agent in state (n=-2, s=1), then you can incentivize it not to press the button by weighting the corresponding component utility function highly, for example use .

But this just makes the reverse problem worse. If the agent reaches (n=2, s=1) and an operator decides to flip the switch to cause the agent to go in reverse, then the agent now has even more incentive to flip the switch back.

MIRI's corrigibility paper proves a theorem that is analogous to this (Theorem 1 in the paper).

Including "correction" component utility functions

A related approach (also from MIRI's corribility paper) to preventing the agent from actuating the switch on its own is called utility indifference. In this formulation, the combined utility function of the agent includes a third utility function, whose only purpose is to make the agent agnostic to flipping the switching. It does this by exactly counteracting any utility that the agent may gain by changing the state of the switch. This function therefore depends on the full state.

In the 1-dimensional line problem described above, the correction utility function would be defined as:

if remains the same,
if and is about to change, and
if and is about to change

This correction function makes the value of flipping the switch exactly equivalent to the value of the current state, so the agent will be indifferent about it. In this example, there's always a better thing to do (unless you're at an end of the line).

The correction function as described in MIRI's paper also has a dependence on other timesteps (it's not markov). Furthermore, the paper describes some serious issues with it.

Adjusting the state space of the component utility functions

By adjusting the state space of the component utility functions, we can define a utility function that correctly values what we think we value.

Ultimately, what we care about is that the agent does not change the switch state itself. If we can represent this as a part of the world state, then we can do a domain extension on the original utility components.

Let $i$ be a variable representing, roughly "the agent will pursue a strategy to change the state of the switch". Then we can construct new utility components as follows:

if else
if else

If we further care that the agent doesn't do anything to tamper with the switch, or to manipulate people into treating the switch in one way or another, these cares can be dealt with in the same way. Construct a world-state representation that allows the agent to model its own impact, and then correctly domain extend the component utility functions.

To a large extent, this passes the buck from creating good value functions to determining how an agent can create intentional models of itself. I think this is a good change in perspect for two reasons.

1. Changing the domain of the utility function accurately captures what we care about. If we're attempting to adjust weights on the original utility functions, or add in compensating utility functions, then we are in some sense attempting to smuggle in a representation of the world that's not contained in our original world-state. We actually do care about whether the agent has an intention of flipping the switch. The only reason not to make the agent care about that also is if its not feasible to do so.

2. Figuring out how to get an agent to model its own intentions is a problem that people are already working on. The actual problem of representing an agents intention to flip the switch reminds me of one-boxing on Newcomb's problem, and I'm curious to explore that more. Using an agents representation of itself as part of its world model seems intuitively more tractable to me.

The main question left is "how do you create a utility function over the beliefs of the agent?"