In praise of Ad Hominems

Ad hominems get a bad rap.

Specifically, there are instances where knowing that the person who thought up an idea has certain flaws is very useful in evaluating the idea.

In the best case scenario, I can evaluate every argument I hear on its own merits. Unfortunately, I’m often too busy to put enough time into every argument that I hear. I might just read enough of an argument to get the gist, and then move on to the next thing I’m interested in. This has bitten me a few times.

If I know that the author of an article is intellectually sloppy, that actually helps me quite a bit when it comes to evaluating their arguments. I’ll put more time into an article they’ve written, because I now feel that its more important to evaluate it for myself.

If I know more specifically that an author doesn’t understand supply and demand (or whatever), then that tells me exactly what parts of their argument to hone in on for more verification.

The general case of just dismissing an argument because the person making it has some flaw does still seem bad to me. It makes sense to know what kind of person is giving the argument, because that can point you at places that the argument may be weakest. This allows you to verify more quickly whether you think the argument itself is right.

Ad hominems shouldn’t end an argument, but they can be a useful argument direction-finder.

Seeing problems coming

I’ve written a lot about agent models recently. The standard expectation maximization method of modeling agents seems like it’s subject to several weaknesses, but there also seem to be straightforward approaches to dealing with those weaknesses.

1. to prevent wireheading, the agent needs to understand its own values well enough to predict changes in them.
2. to avoid creating an incorrigible agent, the agent needs to be able to ascribe value to its own intentions.
3. to prevent holodeck addiction, an agent needs to understand how its own perceptions work, and predict observations as well as outcomes
4. to prevent an agent from going insane, the agent must validate its own world-model (as a function of the world-state) before each use

The fundamental idea in all of these problems is that you can’t avoid a problem that you can’t see coming. Humans use this concept all the time. Many people feel uncomfortable with the idea of wireheading and insanity. This discomfort leads people to take actions to avoid those outcomes. I argue that we can create artificial agents that use similar techniques.

The posts linked above showed some simple architecture changes to expectation maximization and utility function combinations. The proposed changes mostly depend on one tool that I left unexplored: representing the agent in its own model. The agent needs to be able to reason about how changes to the world will affect its own operation. The more fine-grained this reasoning can be, the more the agent can avoid the above problems.

Some requirements of the world-model of the agent are:

  • must include a model of the agent’s values
  • must include all parts of the world that we care about
  •  must include the agent’s own sensors and sense methods
  • must include the agent’s own thought processes

This is a topic that I’m not sure how to think about yet. My learning focus for the next while is going to shift to how models are learned (e.g. through reinforcement learning) and how agent self-reflection is currently modeled.

Agent Insanity

The wireheading and holodeck problems both present ways an agent can intervene on itself to get high utility without actually fulfilling its utility function.

In wireheading, the agent adapts its utility function directly so that it returns high values. In the holodeck problem, the agent manipulates its own senses so that it thinks it’s in a high value state.

Another way that an agent can intervene on itself is to manipulate its model of the world, so that it incorrectly predicts high valued states even given valid observations. I’ll refer to this type of intervention as inducing insanity.

Referring again to the decision theoretic model, agents predict various outcomes for various actions, and then evaluate how much utility they get for an action. This is represented symbolically as p(state-s, a -> o; x)*Utility(o). The agent iterates through this process for various options of action and outcome, looking for the best decision.

Insanity occurs whenever the agent attempts to manipulate its model of the world, p(state-s, a -> o; x), in a way that is not endorsed by the evidence the agent has. We of course want the agent to change its model as it makes new observations of the world; that’s called learning. We don’t want the agent to change its model just so it can then have a high reward.

Insanity through recursive ignorance

Consider an agent that has a certain model of the world being faced with a decision whose result may make its model become insane. Much like the wireheading problem, the agent simulates its own actions recursively to evaluate the expected utility of a given action. In that simulation of actions, one of those actions will be the one that degrades the agent’s model.

If the agent is unable to represent this fact in its own simulation, then it will not be able to account for it. The agent will continue to make predictions about its actions and their outcomes under the assumption that the insanity-inducing act has not compromised it. Therefore the agent will not be able to avoid degrading its prediction ability, because it won’t notice it happening.

So when recursing to determine the best action, the recursion has to adequately account for changes to the agent’s model. Symbolically, we want to use p'(state-s, a -> o; x) to predict outcomes, where p’ may change at each level of the recursion.

Predicting your decision procedure isn’t enough

Mirroring the argument in wireheading, just using an accurate simulated model of the agent at each step in the decision recursion will not save the agent from insanity. If the agent is predicting changes to its model and then using changed models uncritically, that may only make the problem worse.

The decision theory algorithm assumes that the world-model the agent has is accurate and trustworthy. We’ll need to adapt the algorithm to account for world-models that may be untrustworthy.

The thing that makes this difficult is that we don’t want to limit changes to the world-model too much. In some sense, changing the world-model is the way that the agent improves. We even want to allow major changes to the world-model, like perhaps switching from a neural network architecture to something totally different.

Given that we’re allowing major changes to the world-model, we want to be able to trust that those changes are still useful. Once we predict a change to a model, how can we validate the proposed model?

Model Validation

One answer may be to borrow from the machine learning toolbox. When a neural network learns, it is tested on data that it hasn’t been trained on. This dataset, often called a validation set, tests that the network performs well and helps to avoid some common machine learning problems (such as overfitting).

To bring this into the agent model question, we could use the observations that the agent has made to validate the model. We would expect the model to support the actual observations that the agent has made. If a model change is predicted, we could run the proposed model on past observations to see how it does. It may also be desirable to hold out certain observations from the ones generally used for deciding on actions, in order to better validate the model itself.

In the agent model formalism, this might look like:

function decide(state-s):
  max_a = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      if not valid_model(state-s, x):
        utility(a) += Utility(insanity)
      else:
        utility(a) += p(state-s, a -> o; x)*Utility(o)
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
    end if
  end for
  return action max_a

function transition(old_state, action_a):
  return new_state obtained by taking action_a in old_state;

function Utility(test_state):
  if test_state == insanity:
    return value(insanity) // some low value
	
  current_value = value(test_state)
  future_value = value(transition(test_state, decide(test_state)))
  return (current_value + future_value)

In this formalism, we check to see if the model is sane each time before we use it. The valid_model function determines if the model described in state-s is valid given the observations x.

Creating a function that can validate a model given a world state is no easy problem. The validation function may have to deal with unanticipated model changes, models that are very different than the current one, and models that operate using new ontologies.

It’s not totally clear how to define such a validation function, and if we could, that may solve most of the strong AI problem in the first place.

If we don’t care about strong improvements to our agent, then we may be able to write a validation function that disallows almost all model changes. By allowing only a small set of understandable changes, we could potentially create agents that we could be certain would not go insane, at the cost of being unable to grow significantly more sane than they start out. This may be a cost we want to pay.

The holodeck problem

The holodeck problem is closely related to wireheading. While wireheading directly stimulates a reward center, the holodeck problem occurs when an agent manipulates its own senses so that it observes a specific high value scenario that isn’t actually happening.

Imagine living in a holodeck in Star Trek. You can have any kind of life you want; you could be emperor. You get all of the sights, smells, sounds, and feels of achieving all of your goals. The problem is that the observations you’re making don’t correlate highly with the rest of the world. You may observe that you’re the savior of the human race, but no actual humans have been saved.

Real agents don’t have direct access to the state of the world. They don’t just “know” where they are, or how much money they have, or whether there is food in their fridge. Real agents have to infer these things from observations, and their observations aren’t 100% reliable.

In a decision agent sense, the holodeck problem corresponds to the agent manipulating its own perceptions. Perhaps the agent has a vision system, and it puts a picture of a pile of gold in front of the camera. Or perhaps it just rewrites the camera driver, so that the pixel arrays returned show what the agent wants.

If you intend on making a highly capable agent, you want to be able to ensure that it won’t take these actions.

Decision Theoretic Observation Hacking

A decision theoretic agent attempts to select actions that maximize its utility based on what effect they expect those actions to have. They are evaluating the equation p(state-s, a -> o; x)U(o) for all the various actions (a) that they can take.

As usual, U(o) is the utility that the agent ascribes to outcome o. The agent models how likely outcome o is to happen based on how it thinks the world is arranged right now (state-s), what actions are available to it (a), and its observations of the world in the past (x).

The holodeck problem occurs if the agent is able to take actions (a) that manipulate its future observations (x). Doing so changes the agent’s future model of the world.

Unlike the wireheading problem, an agent that is hacking its observational system still values the right things. The problem is that it doesn’t understand that the changes it is making are not impacting the actual reward you want the agent to optimize for.

We don’t want to “prevent” an agent from living in a holodeck. We want an agent that understands that living in a holodeck doesn’t accomplish its goals. This means that we need to represent the correlation of its sense perceptions with reality as a part of the agent’s world-model M.

The part of the agent’s world-model that represents its own perceptual-system can be used to produce an estimate of the perceptual system’s accuracy. Perhaps it would produce some probability P(x|o), the probability of the observations given that you know the outcome holds. We would then want to keep P(x|o) “peak-y” in some sense. If the agent gets a different outcome, but its observations are exactly the same, then its observations are broken.

We don’t need to have the agent explicitly care about protecting its perception system. Assuming the model of the perception system is accurate, and agent that is planning future actions (by recursing on its decision procedure) would predict that entering a holodeck would cause the P(x|o) to become almost uniform. This would lower the probability that it ascribes to high value outcomes, and thus be a thing to avoid.

The agent could be designed such that it is modeling observations that it might make, and then predicting outcomes based on observations. In this case, we’d build p(state-s, a -> o; x) such that prediction of the world-model M^{a\rightharpoonup} are predictions over observations x. We can then calculate the probability of an outcome o given an observation x using Bayes’ Theorem:

P(o|x) = \frac{P(x|o)P(o)}{P(x)}.

In this case, the more correlated an agent believes its sensors to be, the more it will output high probabilities for some outcome.

Potential issues with this solution

Solving the holodeck problem in this way requires some changes to how agents are often represented.

1. The agent’s world-model must include the function of its own sensors.
2. The agent’s predictions of the world should predict sense-perceptions, not outcomes.
3. On this model, outcomes may still be worth living out in a holodeck if they are high enough value to make up for the low probability that they have of existing.

In order to represent the probability of observations given an outcome, the agent needs to know how its sensors work. It needs to be able to model changes to the sensors, the environment, and it’s own interpretation of the sense data and generate P(o|x) from all of this.

It’s not yet clear to me what all of the ramifications of having the agent’s model predict observations instead of outcomes is. That’s definitely something that also needs to be explored more.

It is troubling that this model doesn’t prevent an agent from entering a holodeck if the holodeck offers observations that are in some sense good enough to outweigh the loss in predictive utility of the observations. This is also something that needs to be explored.

Safely Combining Utility Functions

Imagine you have two utility functions that you want to combine: U_1(s) : S_1 \rightarrow \mathbb{R} and U_2(s) : S_2 \rightarrow \mathbb{R}

In each case, the utility function is a mapping from some world state to the real numbers. The mappings do not necessarily pay attention to all possible variables in the world-state, which we represent by using two different domains, each an element of some full world state (S_1, S_2 \subset S_w). By S_w we mean everything that could possibly be known about the universe.

If we want to create a utility function that combines these two, we may run into two issues:

1. The world sub-states that each function “pays attention to” may not overlap (S_1 \neq S_2).
2. The range of the functions may not be compatible. For example, a utility value of 20 from U_1 may correspond to a utility value of 118 from U_2.

Non-equivalent domains

If we assume that the world states for each utility function are represented in the same encoding, then the only way for S_1 \neq S_2 is if there are some dimensions, some variables in S, that are represented in one sub-state representation but not the other. In this case, we can adapt each utility function so that they share the same domain by adding the unused dimensions to each utility function.

As a concrete example, observe the following utility functions:

U_1(r) : n red marbles \rightarrow n
U_2(b) : n blue marbles \rightarrow 10n

These can be adapted by extending the domain as follows:

U_1(r,b) : n red marbles, m blue marbles \rightarrow n
U_2(r,b) : n red marbles, m blue marbles \rightarrow 10m

These two utility functions now share the same domain.

Note that this is not a procedure that an be done without outside information. Just looking at the original utility functions doesn’t tell you what those sub-utility functions would prefer given an added variable. The naive case is that the utility functions don’t care about that other variable, but we’ll later see examples where that isn’t what we want.

Non-equivalent valuations

The second potential problem in combining utility functions is that the functions you’re combining may represent values differently. For example, one function’s utility of 1 may be the same as the other’s utility of 1000. In simple cases, this can be handled with an affine transformation.

As an example, from our perspective of U_1(r,b) and U_2(r,b), U_2 should be valued at only 2 times U_1 instead of the 10 times as is shown above. One of the ways that we can adapt this is by setting U_2a(r,b) = \frac{1}{5}U_2(r,b).

Note that non-equivalent valuations can’t be solved by looking only at the utility functions. We need to appeal to some other source of value to know how they should be adapted. Basically, we need to know why the specific valuations were chosen for those utility functions before we can adapt them so that they share the same scale.

This may turn out to be a very complicated transformation. We can represent it in the general case using arbitrary functions f_1(.) and f_2(.).

Combining Utility Functions

Once we have our utility functions adapted so that they use the same domain and valuation strategy, we can combine them simply by summing them.

U_c(r,b) = f_1(U_1(r,b)) + f_2(U_2(r,b))

The combined utility function U_c(r,b) will cause an agent to pursue both of the original utility functions. The domain extension procedure ensures that the original utility functions correctly account for what the new state is. The valuation normalization procedure ensures that the original utility functions are valued correctly relative to each other.

A more complicated case

Let’s say that you now want to combine two utility functions in a more complex way. For example, lets say you have two utility functions the use the same valuation and domain:

U_a(n) = n
U_b(n) = -n

Let’s say our world is such that n corresponds to a location on a line, and n \in [-2, -1, 0, 1, 2]. One of the utility functions incentivizes an agent to move up the line, the other incentivizes the agent to move down the line. These utility functions clearly have the same domain, and we’re assuming they have the same valuation metric. But if we add them up we have utility 0 everywhere.

To combine these, we may wish to introduce another world-state variable (say s for switch). If s == 1 then we want to use U_a(n), and if s == 0 then we want to use U_b(n). You could think of this as “do something when I want you to, and undo it if I press the button.”

One way that we could do this is to extend each utility function to include the new state variable, and set the utility of the function to 0 in the half of the new domain that we don’t want it to be active. To do this, we could create:

U_a'(s, n) = n if (s==1) else 0
U_b'(s, n) = -n if (s==0) else 0

When we sum these adapted utility functions, we find that we have a nice utility function that incentivizes the agent to move towards 2 if the switch is on and to move towards -2 if the switch is off.

U_{ab}' = U_a'(s,n) + U_b'(s,n)

Except that there’s a pathological case hiding out here. What if the agent can control the state of the switch?

In that case, an agent that finds itself starting out at state (n=2, s=0), may just flip the switch rather than moving.

Agents aren’t allowed to throw the switch

The behavior of an agent with utility function U_{ab}' is clearly not what we want. To fix it we have a few options:

1. change how we combine its components U_a'(s,n) and U_b'(s,n), perhaps by weighting them in some way
2. introduce one or more other component utility functions
3. change the state space upon which the functions operate

Weightings on U_a'(s,n) and U_b'(s,n)

Many of the attempts to change the weights of the component functions will lead to un-intended consequences.

If you start the agent in state (n=-2, s=1), then you can incentivize it not to press the button by weighting the corresponding component utility function highly, for example use 100 + U_a'(s,n).

But this just makes the reverse problem worse. If the agent reaches (n=2, s=1) and an operator decides to flip the switch to cause the agent to go in reverse, then the agent now has even more incentive to flip the switch back.

MIRI’s corrigibility paper proves a theorem that is analogous to this (Theorem 1 in the paper).

Including “correction” component utility functions

A related approach (also from MIRI’s corribility paper) to preventing the agent from actuating the switch on its own is called utility indifference. In this formulation, the combined utility function of the agent includes a third utility function, U_{corr}(s,n) whose only purpose is to make the agent agnostic to flipping the switching. It does this by exactly counteracting any utility that the agent may gain by changing the state of the switch. This function therefore depends on the full state.

U_i(s,n) = U_1(s,n) + U_2(s,n) + U_{corr}(s,n)

In the 1-dimensional line problem described above, the correction utility function would be defined as:

U_{corr}(s,n) = 0 if s remains the same,
U_{corr}(s,n) = sign(n)2n if s == 0 and is about to change, and
U_{corr}(s,n) = -sign(n)2n if s == 1 and is about to change

This correction function makes the value of flipping the switch exactly equivalent to the value of the current state, so the agent will be indifferent about it. In this example, there’s always a better thing to do (unless you’re at an end of the line).

The correction function as described in MIRI’s paper also has a dependence on other timesteps (it’s not markov). Furthermore, the paper describes some serious issues with it.

Adjusting the state space of the component utility functions

By adjusting the state space of the component utility functions, we can define a utility function that correctly values what we think we value.

Ultimately, what we care about is that the agent does not change the switch state itself. If we can represent this as a part of the world state, then we can do a domain extension on the original utility components.

Let i be a variable representing, roughly “the agent will pursue a strategy to change the state of the switch”. Then we can construct new utility components as follows:

U_a''(i, s, n) = U_a'(s, n) if i == 0 else -1000
U_b''(i, s, n) = U_b'(s, n) if i == 0 else -1000

If we further care that the agent doesn’t do anything to tamper with the switch, or to manipulate people into treating the switch in one way or another, these cares can be dealt with in the same way. Construct a world-state representation that allows the agent to model its own impact, and then correctly domain extend the component utility functions.

To a large extent, this passes the buck from creating good value functions to determining how an agent can create intentional models of itself. I think this is a good change in perspect for two reasons.

1. Changing the domain of the utility function accurately captures what we care about. If we’re attempting to adjust weights on the original utility functions, or add in compensating utility functions, then we are in some sense attempting to smuggle in a representation of the world that’s not contained in our original world-state. We actually do care about whether the agent has an intention of flipping the switch. The only reason not to make the agent care about that also is if its not feasible to do so.

2. Figuring out how to get an agent to model its own intentions is a problem that people are already working on. The actual problem of representing an agents intention to flip the switch reminds me of one-boxing on Newcomb’s problem, and I’m curious to explore that more. Using an agents representation of itself as part of its world model seems intuitively more tractable to me.

The main question left is “how do you create a utility function over the beliefs of the agent?”

Wireheading Defense

I once talked to somebody about doing heroin. I’ve never done it, and I was curious what it felt like. This person told me that heroin gave you the feeling of being love; that it was the best feeling he’d ever felt.

Hearing that did not make me want to do heroin more, even though I believed that it would cause me to feel such a great feeling. Instead, I became much more concerned about not letting myself give into the (admittedly slight) possibility that I might try it.

When I thought about trying it, I had a visceral reaction against it. The image that popped into my mind was myself, all alone in feeling love, ignoring the people that I actually loved. It was an image of being disconnected from the world.

Utility Functions

Utility functions form a large part of agent modeling. The idea is that if you give a rational agent a certain utility function, the agent will then act as though it wants what the utility function says is high value.

A large worry people have about utility functions is that some agent will figure out how to reach inside its own decision processes, and just tweak the number for utility to maximum. Then it can just sit back and do nothing, enjoying the sensation of accomplishing all its goals forever.

The term for this is wireheading. It hearkens to the image of a human with a wire in their brain, electrically stimulating the pleasure center. If you did this to someone, you would in some sense be destroying what we generally think of as the best parts of a person.

People do sometimes wirehead (in the best way they can manage now), but it’s intuitive to most people that it’s not good. So what is it about how humans think about wireheading that makes them relatively immune to it, and allows them to actively defend themselves from the threat of it?

If I think about taking heroin, I have a clear sense that I would be making decisions differently than I do now. I predict that I would want to do heroin more after taking than before taking it, and that I would prioritize it over things that I value now. None of that seems good to me right now.

The thing that keeps me from doing heroin is being able to predict what a heroin-addicted me would want, while also being able to say that is not what I want right now.

Formalizing Wirehead Defense

Consider a rational decision maker who uses expectation maximization to decide what to do. They have some function for deciding on an action that looks like this:

function decide(state-s):
  max_a = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      utility(a) += p(state-s, a -> o)*Utility(o)
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
    end if
  end for
  return action max_a

The decider looks at all the actions available to them given the situation they’re currently in, and chooses the action that leads to the best outcome with high probability.

If the decider is making a series of decisions over time, they’ll want to calculate their possible utility recursively, by imagining what they would do next. In this case, the utility function would be something like:

function transition(old_state, action_a):
  return new_state obtained by taking action_a in old_state;

function Utility(test_state):
  current_value = value(test_state)
  future_value = value(transition(test_state, decide(test_state)))
  return (current_value + future_value)

The transition function simulates taking an action in a given situation, and then returns the resulting new situation.

In the Utility function, the overall utility is calculated by determining the value of the current situation plus the value of the next situation as predicted by the decide() function.

To determine the value of a situation, the value() call just returns the observed value of the current world state. It may be  a table of (situation, value) pairs or something more complicated.

In this way, we figure out what utility we get by seeing what the value is on exact next step, and adding to it the expected value for subsequent steps. This process could recursively call itself forever, so in practice there would be either a recursion depth limit or some stopping criterion in the states being tested.

This recursion can be thought of as the robot simulating its own future actions.

The wireheading threat appears if we find a state or set of states in the future that provide high utility as calculated by this function but don’t correspond to a high utility in the outside world (perhaps as determined by the designers of this robot).

In the traditional story, the robot finds a way to reach into its own code and tweak the value() function so that it returns only high numbers. Basically, it performs brain surgery on itself.

To consider a robot defensive against wireheading, we would want it to assign low utility to performing brain surgery on itself, even while it understands that it would later achieve very high self-reported utility.

Approaching a wirehead decision

Let’s say that the above algorithm is computing a policy for future actions, and it comes to consider an action that would result in what outside observers would call wireheading. Maybe it is considering changing a line of its own code, or taking heroin, or submitting to brain surgery. What is the above algorithm actually doing in that case?

To get to this point, the robot must have called the function “decide(s)” on a state where it is not currently wireheaded. In the course of figuring out its next action, the robot will consider an action that changes the robot itself in some way.

The line “utility(a) += p(s, a->o; x)*Utility(o)” calculates the probability that the action would lead to the outcome, then multiplies it by the utility of the outcome. In this case the action is brain surgery and the outcome is having a new “value()” function.

Whether or not this is a good plan depends on the “Utility(o)”, which will just recursively call the “decide(o)” function again to find future value.

The crucial point here is that when “decide(o)” is called, the state “o” is such that a different type of decision making is now happening. Now, instead of simulating its own future actions, the robot should be simulating the actions of itself with a different program running.

Not much has been said up to now about what this “state” thing is. In some sense, it represents everything the robot knows about the world. Where objects are, what they are, how does physics work, etc.

What if the robot doesn’t consider it’s own state?

If the robot does not consider its own code (and other features) as a part of the state of the world, then the wireheading action would not clearly modify the world that the robot knows about. The decision algorithm would keep on predicting normal behavior after the wireheading had occurred: “sure you had brain surgery, but you still think the same way right?”

In this case, the robot may choose to wirehead because its decision algorithm calculated that it would be useful in some normal way. Once the wireheading had been done, the robot would then be making decisions using a different algorithm. The wireheaded robot would stop pursuing the plan that the original robot had been pursuing up to the point of being wireheaded, and begin to pursue whatever plan the wireheaded version of itself espoused.

This is equivalent to how humans get addicted to drugs. Few (no?) humans decide that being addicted to heroin would be great. Instead, heroin seems like a way to achieve a goal the human already has.

People may start taking heroin because they want to escape their current situation, or because they want to impress their friends, or because they want to explore the varieties of human consciousness.

People keep taking heroin because they are addicted.

What if the robot does consider its own state?

If the robot considers its own state, then when it recurses on the “decide(o)” it will be able to represent the fact that its values would have changed.

In the naive case, it runs the code exactly as listed above with an understanding that the “value()” function is different. In this case, the new “value()” function is reporting very high numbers for outcomes that the original robot wouldn’t. If the wireheading were such that utility was now calculated as some constant maximum value, then every action would be reported to have the same (really high) utility. This makes the original robot more likely to choose to wirehead.

So simply changing the “value()” function makes the problem worse and not better.

This would be equivalent to thinking about heroin, realizing that you’ll get addicted and really want heroin, and deciding that if future you wants heroin that you should want it too.

So considering changes to its own software/hardware isn’t sufficient. We need to make a few alterations to the decision process to make it defensive against wireheading.

The difference between “what you would do” and “what future-you would do”

The problem with not taking into account a preference change after wireheading is that the robot would incorrectly predict its post-wirehead actions.

The problem with just packaging robot preferences in with the world-state of the prior algorithm is that, even though the robot is then able to correctly predict future actions, the valuations aren’t consistent. A wireheaded robot takes the actions it thinks are highest utility, it just happens to be choosing actions the original would think were terrible.

In order to defend against wireheading, you need to:

1. accurately predict what a future (wireheaded) version of yourself would do
2. determine a value of future states that depends only on your current utility function

To get item 2 without sacrificing item 1, we’re going to adapt our decision algorithm slightly.

function decide2(state-s):
  max_a = 0
  max_plan = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      (o_value, o_plan) = Utility2(o)
      utility(a) += p(state-s, a->o)*o_value
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
      max_plan = o_plan
    end if
  end for
  return (max_a, [s, max_plan])
  
function Utility2(test_state):
  current_value = test_state.value(test_state)
  (next_action, state_plan) = decide(test_state)
  
  future_value = 0
  for state in state_plan:
    future_value = test_state.value(state)
  
  test_state_utility = (current_value + future_value)

  return (test_state_utility, state_plan)

In this case, the decide2 function returns a tuple. The first element of the tuple is the next action to take. That’s the same as the only return value in the original decide function. The second element of the decide2 function is a complete future-history: a list of the expected states given what the robot thinks it will decide.

The Utility2 function also returns a tuple. The new element is the same state-plan as the decide2 function. In this case, the Utility2 function re-values each of the proposed future states using the value function of the current state. If a predicted version of a robots code makes decisions using a different value system, all the robot cares about is if its decisions lead to valuable states in its current value system.

Wirehead defense isn’t wirehead immunity

The adapted decision algorithm described above will avoid wireheading when wireheading obviously results in lower utilities. It will not avoid doing all behaviors that a human might think of as wireheading. It may choose to do the equivalent of heroin if the risk of addiction is low, or if the potential gain (as measured using the current utility function) is high.

The above algorithm also won’t stop wireheading if the robot gets tricked into it. As long as the algorithm can “see it coming” in some sense, it will attempt to avoid it. To see it coming, the algorithm needs to have access to its own code. It also needs to be able to modify a representation of its own code and simulate the modifications. There are some circumstances in which we may not want the robot to simulate arbitrary changes to its value function.

In the worst possible case, an attacker could arrange a situation in which the robot has the opportunity to change its value function in some complicated way. The attacker may be able to propose a clever value function that, if simulated, executes arbitrary code on the robot. The risk for this seems higher for more complicated value functions. There are ways to mitigate this risk, but it’s not something to take lightly.

The Woodcarver

Once there was a wood carver who lived at the edge of the village. He was the best wood carver for miles and miles, but he was also very clumsy. People would come to marvel at his carvings, and then giggle as he dropped his tools or spilled his coffee.

The wood carver didn’t mind the giggling. He had a fine life, and wanted for nothing. Nothing, that is, except a child.

One day, as he was walking the woods to find good stock, he came upon a mysterious stump. The stump glowed like a full moon in the brightest daylight. It was the most marvelous wood that the carver had ever seen, and he brought it back to his shop immediately.

For seven days and seven nights, the carver worked on the strange wood. When he was done, he looked in pride at a wooden boy. The carver was only a little surprised when the boy’s eyes opened, and the boy looked back at him.

But as the wood carver stared into the boys eye’s he realized something. There was nothing within those eyes, no spark of recognition. The wooden boy was the blankest of blank slates.

The woodcarver wasn’t worried by this. He always thought he’d make a great father, and he set to the task with diligence. He taught the wooden boy how to move his arms, how to walk, how to talk. Finally, he taught the boy his most cherished knowledge: the carving of wood.

But even as a father, the carver was still very clumsy. He would demonstrate how to walk, only to trip over his own feet. He would try to show how to talk, only to mis-speak or mumble his words. Even at wood-carving, the carver would demonstrate a cut and drop his knife to the floor.

The wooden boy learned all these things. The boy learned to walk and to trip, to talk and to mumble, to carve and to drop tools. The boy was a very good student.

When the wood carver told the boy not to trip, the boy learned to say that you shouldn’t trip. Still the boy tripped, but now he seemed contrite about it.

The wood carver and his new son lived happily for many years. As the wood carver aged, he marveled that the boy did not.

There came a day when the wood carver had to be laid to rest in a box of his own design. The wooden boy cried, just as he had been taught. Then he went home and carved wood.

One day, many years later, the boy was gleaning in the woods for new carving stock. The boy came upon a strange and eerie stump. It glowed with the light of the full moon, even at the brightest part of the day. The wooden boy knew exactly what to do.

Mutual Information in a Causal Context

Mutual information is the idea that learning something about one variable might tell you about another. For example, learning that it’s daytime might give you information about whether the sun is shining. it could still be cloudy, but you can be more sure that it’s sunny than before you learned it was daytime.

Mathematically, mutual information is represented using the concept of entropy. The information gained about a variable X, assuming you learn Y, is given by: I(X;Y) = H(X) - H(X|Y)

In this case, H(.) is a measure of the entropy. It is given by H(X) = \sum_x p(x) \log_2(\frac{1}{p(x)})

Mutual information is supposed to be symmetric (I(X;Y) = I(Y;X)), but I’m interested in how that works in a causal context.

Let’s say you have a lightbulb that can be turned on from either of two light switches. If either lightswitch is on, then the bulb is on. Learning that one light switch is on tells you the bulb is on, but learning that the bulb is on does *not* tell you that one specific light switch is on. It tells you that at least one is on (but not which one).

Let’s assume for the sake of argument that each light switch has a probability p(on) = 0.25 of being turned on (and equivalently a probability p(off) = 0.75 of being off). Assume also that they’re independent.

The entropy of switch one is

H(S1) = p(on)\log_2(\frac{1}{p(on)}) + p(off)\log_2(\frac{1}{p(off)})
H(S1) = 1/4* \log_2(4) + 3/4 * \log_2(\frac{4}{3})
H(S1) = 0.811

Since either switch has a probability of 0.25 of being on, and they’re independent, the bulb itself has a probability of 7/16 of being on.

The entropy of the bulb is

H(B) = p(on)\log_2(\frac{1}{p(on)}) + p(off)\log_2(\frac{1}{p(off)})
H(B) = 7/16 * \log_2(\frac{16}{7}) + 9/16 * \log_2(\frac{16}{9})
H(B) = 0.989

If you know switch 1’s state, then the information you have about the light is given by

I(B;S1) = H(B) - H(B|S1)
I(B;S1) = H(B) - (3/4*H(B|S1=off) + 1/4*H(B|S1=on))
I(B;S1) = 0.989 - (3/4*0.811 + 1/4*0) = 0.380

If instead you know the bulb’s state, then the information you have about switch 1 is given by

I(S1;B) = H(S1) - H(S1|B)
I(S1;B) = H(S1) - (9/16*H(S1|B=off) + 7/16*H(S1|B=on))
I(S1;B) = 0.811 - (9/16*0 + 7/16 * 0.985) = 0.380

So even in a causal case the mutual information is still symmetric.

For me the point that helps give an intuitive sense of this is that if you know S1 is on, you know the bulb is on. Symmetrically, if you know the bulb is off, you know that S1 is off.

Ontologies of Utility Functions

In his paper on the Value Learning Problem, Nate Soares identifies the problem of ontology shift:

Consider a programmer that wants to train a system to pursue a very simple goal: produce diamond. The programmers have an atomic model of physics, and they generate training data labeled according to the number of carbon atoms covalently bound to four other carbon atoms in that training outcome. For this training data to be used, the classification algorithm needs to identify the atoms in a potential outcome considered by the system. In this toy example, we can assume that the programmers look at the structure of the initial worldmodel and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis.

The programmer defined what they wanted in an ontology that their system no longer uses, so the programmer’s goals are now no long relevant to what the system is actually interacting with.

To solve this problem, an artificial intelligence would have to notice when it is changing ontologies. In the story, the system knows about carbon as a logical concept, and then abandons the carbon concept when it learns about protons, neutrons, and electrons. On abandoning the concept of carbon (or any other concept), the system could re-evaluate its utility function to see if the change causes a new understanding of something within that utility function.

Intuitively, a system smart enough to say that carbon is actually made up of 6 protons could reflect the impact of such a discovery on the utility function.

A more worrying feature of an ontology shift is that it implies that an AI may be translating it’s utility function into its current ontology. The translation operation is unlikely to be obvious, and may allow not just direct translation but also re-interpretation. The translated utility function may not be endorsed by the AI’s original programmer.

This is true even if the utility function is something nice like “figure out what I, your creator, would do if I were smarter, then do that.” The ontology that the agent uses may change, and what “your creator” and “smarter” mean may change significantly.

What we’d like to have is some guarantee that the utility function used after an ontology shift satisfies the important parts of the utility function before the shift. This is true whether the new utility function is an attempt at direct translation or a looser re-interpretation.

One idea for how to do this is to find objects in the new ontology that subjunctively depend upon the original utility function. If it can be shown that the new utility function and the old one are in some sense computing the same logical object, then it may be possible to trust the new utility function before it is put in place.

Grudge

Way back around 500 BC, the Athenians took part in a rebellion against the Persian King Darius. When Darius learned of it, he was furious. He was apparently so worried that he wouldn’t punish the Athenians that he had a servant remind him. Every evening at dinner, the servant was to interrupt him three times to say “remember the Athenians.”

There are a few people in my life that I’ve majorly changed my mind about. For most of them, I started off liking them quite a bit. Then I learn of something terrible that they’ve done, or they say something very mean to me, and I stop wanting to be friends with them.

Sometimes mutal friends have tried to intervene of their behalf. “Don’t hold a grudge,” they tell me.

I have to imagine that when people advise you not to hold a grudge, they’re imagining something like King Darius. If only Darius could stop reminding himself about the Athenian betrayal, he could forgive them and everything could go back to the way it was.

I don’t have calendar reminders to keep me from forgetting what people have done. I haven’t gone into my phone to delete anyone’s phone number.

For me, the situation is very different. I may be consciously angry with some transgression for a while, but that emotion dissipates over the course of a few days to a few weeks. What really sticks with me is not the feeling of anger. It’s the change in my model of what that person is likely to do.

When I think of spending time with someone, I have some sense of what that time would be like. If that sense seems good, then I’m excited to hang out with them. If it seems bad, then I’m not. That sense is based on a model of who that person is, and what hanging out with them will be like. It’s not an explicit rehearsal of past times, good or bad.

Models

I try to think about the people I know as being their own person. The sign to me that I know someone well is that I can predict what they’ll care about, give them gifts that they find fun or useful, tell jokes or stories tailor-made for them, imagine their advice to me in a given situation.

The model that I have of someone impacts how much I choose to interact with them, and also in what ways I choose to interact with them.

I try to keep my model of a person up-to-date, since I know people change. Usually they change slowly, and I’m changing with them. Sometimes we grow closer as friends as we change.

Sometimes, I get new evidence about a person that dramatically changes my model of them. This is what it’s like for me if someone surprisingly treats me poorly. I get angry, then the anger fades and all that’s left is a changed model.

But there’s another thing that can change my models of people.

The way I think about people’s words and actions is filtered through how I think the world works. If my model for how the world works changes, then I might suddenly change how I view certain people. They haven’t done anything different than usual, but it now means a very different thing to me.

Forgiveness

When people tell me not to hold a grudge, I think that they want me to treat a person the way I treated them when I had an older model. This is impossible. I can’t erase the evidence that I now have about who they are as a person.

But the thing I need to keep in mind is that I can’t ever have all of the evidence necessary to know who another person is and what they’ll do. If someone screams at me over something, it’s very possible that they rarely yell and it was just a bad day for them. How do I incorporate that into my model?

This is where forgiveness comes in.

If someone does something that is really very bad to you, it may be the most salient feature of your model of them. The thing is, the other parts of your model of them are still valid.

Forgiveness is letting that vivid experience shrink to its proper size in your model. Depending on the event, that proper size may still be large. But by forgiving someone you give them the ability to change your model of them again. You’re letting them show you that they aren’t normally someone who would scream at you. You’re letting them show you that they have changed since then.

Forgiveness isn’t a thing that can be forced. The model that I have of a person isn’t a list that I keep in my head. My model of you isn’t some explicit verbal thing. It’s all the memories I have of you; it’s the felt sense that I get in my gut when I think of you. I can’t just decide that the felt sense is different now.

Forgiveness is a slow growing thing. I can choose to help it along, to feed it with thoughts of compassion and with evidence that my model may be off-base. But regardless of what I try to do, forgiveness takes time.

Apologies

If forgiveness is letting someone’s actions influence your model of them again, then it’s pretty clear that forgiveness isn’t all that is necessary.

In addition to me being willing to update my model of another person, they need to be giving me evidence of who they are. They need to be giving me information to refine my model of them again.

Apologies are one way of doing this. If someone says they’re sorry for something, then that’s some evidence (often weak) that they actually are different than an action made them seem. The best sort of apology then, is some kind of action that really brings home the fact that the person is different. It’s saying sorry, then acting in a way that prevents the transgression from happening again.

This also means that, in order for me to properly apologize to someone else, they need to actually be willing to hear me. Which is kind of a catch-22 in some ways.

I’ve definitely hurt some people with my words and actions in the past. There are a few people that just never want to talk with me again. That’s their right, but it means that I can’t properly apologize. They’ll never see the ways in which I have changed, and their model of me will remain stuck on a person that I’m not anymore.