Wireheading Defense

I once talked to somebody about doing heroin. I've never done it, and I was curious what it felt like. This person told me that heroin gave you the feeling of being love; that it was the best feeling he'd ever felt.

Hearing that did not make me want to do heroin more, even though I believed that it would cause me to feel such a great feeling. Instead, I became much more concerned about not letting myself give into the (admittedly slight) possibility that I might try it.

When I thought about trying it, I had a visceral reaction against it. The image that popped into my mind was myself, all alone in feeling love, ignoring the people that I actually loved. It was an image of being disconnected from the world.

Utility Functions

Utility functions form a large part of agent modeling. The idea is that if you give a rational agent a certain utility function, the agent will then act as though it wants what the utility function says is high value.

A large worry people have about utility functions is that some agent will figure out how to reach inside its own decision processes, and just tweak the number for utility to maximum. Then it can just sit back and do nothing, enjoying the sensation of accomplishing all its goals forever.

The term for this is wireheading. It hearkens to the image of a human with a wire in their brain, electrically stimulating the pleasure center. If you did this to someone, you would in some sense be destroying what we generally think of as the best parts of a person.

People do sometimes wirehead (in the best way they can manage now), but it's intuitive to most people that it's not good. So what is it about how humans think about wireheading that makes them relatively immune to it, and allows them to actively defend themselves from the threat of it?

If I think about taking heroin, I have a clear sense that I would be making decisions differently than I do now. I predict that I would want to do heroin more after taking than before taking it, and that I would prioritize it over things that I value now. None of that seems good to me right now.

The thing that keeps me from doing heroin is being able to predict what a heroin-addicted me would want, while also being able to say that is not what I want right now.

Formalizing Wirehead Defense

Consider a rational decision maker who uses expectation maximization to decide what to do. They have some function for deciding on an action that looks like this:

function decide(state-s):
  max_a = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      utility(a) += p(state-s, a -> o)*Utility(o)
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
    end if
  end for
  return action max_a

The decider looks at all the actions available to them given the situation they're currently in, and chooses the action that leads to the best outcome with high probability.

If the decider is making a series of decisions over time, they'll want to calculate their possible utility recursively, by imagining what they would do next. In this case, the utility function would be something like:

function transition(old_state, action_a):
  return new_state obtained by taking action_a in old_state;

function Utility(test_state):
  current_value = value(test_state)
  future_value = value(transition(test_state, decide(test_state)))
  return (current_value + future_value)

The transition function simulates taking an action in a given situation, and then returns the resulting new situation.

In the Utility function, the overall utility is calculated by determining the value of the current situation plus the value of the next situation as predicted by the decide() function.

To determine the value of a situation, the value() call just returns the observed value of the current world state. It may beĀ  a table of (situation, value) pairs or something more complicated.

In this way, we figure out what utility we get by seeing what the value is on exact next step, and adding to it the expected value for subsequent steps. This process could recursively call itself forever, so in practice there would be either a recursion depth limit or some stopping criterion in the states being tested.

This recursion can be thought of as the robot simulating its own future actions.

The wireheading threat appears if we find a state or set of states in the future that provide high utility as calculated by this function but don't correspond to a high utility in the outside world (perhaps as determined by the designers of this robot).

In the traditional story, the robot finds a way to reach into its own code and tweak the value() function so that it returns only high numbers. Basically, it performs brain surgery on itself.

To consider a robot defensive against wireheading, we would want it to assign low utility to performing brain surgery on itself, even while it understands that it would later achieve very high self-reported utility.

Approaching a wirehead decision

Let's say that the above algorithm is computing a policy for future actions, and it comes to consider an action that would result in what outside observers would call wireheading. Maybe it is considering changing a line of its own code, or taking heroin, or submitting to brain surgery. What is the above algorithm actually doing in that case?

To get to this point, the robot must have called the function "decide(s)" on a state where it is not currently wireheaded. In the course of figuring out its next action, the robot will consider an action that changes the robot itself in some way.

The line "utility(a) += p(s, a->o; x)*Utility(o)" calculates the probability that the action would lead to the outcome, then multiplies it by the utility of the outcome. In this case the action is brain surgery and the outcome is having a new "value()" function.

Whether or not this is a good plan depends on the "Utility(o)", which will just recursively call the "decide(o)" function again to find future value.

The crucial point here is that when "decide(o)" is called, the state "o" is such that a different type of decision making is now happening. Now, instead of simulating its own future actions, the robot should be simulating the actions of itself with a different program running.

Not much has been said up to now about what this "state" thing is. In some sense, it represents everything the robot knows about the world. Where objects are, what they are, how does physics work, etc.

What if the robot doesn't consider it's own state?

If the robot does not consider its own code (and other features) as a part of the state of the world, then the wireheading action would not clearly modify the world that the robot knows about. The decision algorithm would keep on predicting normal behavior after the wireheading had occurred: "sure you had brain surgery, but you still think the same way right?"

In this case, the robot may choose to wirehead because its decision algorithm calculated that it would be useful in some normal way. Once the wireheading had been done, the robot would then be making decisions using a different algorithm. The wireheaded robot would stop pursuing the plan that the original robot had been pursuing up to the point of being wireheaded, and begin to pursue whatever plan the wireheaded version of itself espoused.

This is equivalent to how humans get addicted to drugs. Few (no?) humans decide that being addicted to heroin would be great. Instead, heroin seems like a way to achieve a goal the human already has.

People may start taking heroin because they want to escape their current situation, or because they want to impress their friends, or because they want to explore the varieties of human consciousness.

People keep taking heroin because they are addicted.

What if the robot does consider its own state?

If the robot considers its own state, then when it recurses on the "decide(o)" it will be able to represent the fact that its values would have changed.

In the naive case, it runs the code exactly as listed above with an understanding that the "value()" function is different. In this case, the new "value()" function is reporting very high numbers for outcomes that the original robot wouldn't. If the wireheading were such that utility was now calculated as some constant maximum value, then every action would be reported to have the same (really high) utility. This makes the original robot more likely to choose to wirehead.

So simply changing the "value()" function makes the problem worse and not better.

This would be equivalent to thinking about heroin, realizing that you'll get addicted and really want heroin, and deciding that if future you wants heroin that you should want it to.

So considering changes to its own software/hardware isn't sufficient. We need to make a few alterations to the decision process to make it defensive against wireheading.

The difference between "what you would do" and "what future-you would do"

The problem with not taking into account a preference change after wireheading is that the robot would incorrectly predict its post-wirehead actions.

The problem with just packaging robot preferences in with the world-state of the prior algorithm is that, even though the robot is then able to correctly predict future actions, the valuations aren't consistent. A wireheaded robot takes the actions it thinks are highest utility, it just happens to be choosing actions the original would think were terrible.

In order to defend against wireheading, you need to:

1. accurately predict what a future (wireheaded) version of yourself would do
2. determine a value of future states that depends only on your current utility function

To get item 2 without sacrificing item 1, we're going to adapt our decision algorithm slightly.

function decide2(state-s):
  max_a = 0
  max_plan = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      (o_value, o_plan) = Utility2(o)
      utility(a) += p(state-s, a->o)*o_value
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
      max_plan = o_plan
    end if
  end for
  return (max_a, [s, max_plan])
  
function Utility2(test_state):
  current_value = test_state.value(test_state)
  (next_action, state_plan) = decide(test_state)
  
  future_value = 0
  for state in state_plan:
    future_value = test_state.value(state)
  
  test_state_utility = (current_value + future_value)

  return (test_state_utility, state_plan)

In this case, the decide2 function returns a tuple. The first element of the tuple is the next action to take. That's the same as the only return value in the original decide function. The second element of the decide2 function is a complete future-history: a list of the expected states given what the robot thinks it will decide.

The Utility2 function also returns a tuple. The new element is the same state-plan as the decide2 function. In this case, the Utility2 function re-values each of the proposed future states using the value function of the current state. If a predicted version of a robots code makes decisions using a different value system, all the robot cares about is if its decisions lead to valuable states in its current value system.

Wirehead defense isn't wirehead immunity

The adapted decision algorithm described above will avoid wireheading when wireheading obviously results in lower utilities. It will not avoid doing all behaviors that a human might think of as wireheading. It may choose to do the equivalent of heroin if the risk of addiction is low, or if the potential gain (as measured using the current utility function) is high.

The above algorithm also won't stop wireheading if the robot gets tricked into it. As long as the algorithm can "see it coming" in some sense, it will attempt to avoid it. To see it coming, the algorithm needs to have access to its own code. It also needs to be able to modify a representation of its own code and simulate the modifications. There are some circumstances in which we may not want the robot to simulate arbitrary changes to its value function.

In the worst possible case, an attacker could arrange a situation in which the robot has the opportunity to change its value function in some complicated way. The attacker may be able to propose a clever value function that, if simulated, executes arbitrary code on the robot. The risk for this seems higher for more complicated value functions. There are ways to mitigate this risk, but it's not something to take lightly.