The wireheading and holodeck problems both present ways an agent can intervene on itself to get high utility without actually fulfilling its utility function.

In wireheading, the agent adapts its utility function directly so that it returns high values. In the holodeck problem, the agent manipulates its own senses so that it thinks it’s in a high value state.

Another way that an agent can intervene on itself is to manipulate its model of the world, so that it incorrectly predicts high valued states even given valid observations. I’ll refer to this type of intervention as inducing insanity.

Referring again to the decision theoretic model, agents predict various outcomes for various actions, and then evaluate how much utility they get for an action. This is represented symbolically as p(state-s, a -> o; x)*Utility(o). The agent iterates through this process for various options of action and outcome, looking for the best decision.

Insanity occurs whenever the agent attempts to manipulate its model of the world, p(state-s, a -> o; x), in a way that is not endorsed by the evidence the agent has. We of course want the agent to change its model as it makes new observations of the world; that’s called learning. We don’t want the agent to change its model just so it can then have a high reward.

Insanity through recursive ignorance

Consider an agent that has a certain model of the world being faced with a decision whose result may make its model become insane. Much like the wireheading problem, the agent simulates its own actions recursively to evaluate the expected utility of a given action. In that simulation of actions, one of those actions will be the one that degrades the agent’s model.

If the agent is unable to represent this fact in its own simulation, then it will not be able to account for it. The agent will continue to make predictions about its actions and their outcomes under the assumption that the insanity-inducing act has not compromised it. Therefore the agent will not be able to avoid degrading its prediction ability, because it won’t notice it happening.

So when recursing to determine the best action, the recursion has to adequately account for changes to the agent’s model. Symbolically, we want to use p'(state-s, a -> o; x) to predict outcomes, where p’ may change at each level of the recursion.

Predicting your decision procedure isn’t enough

Mirroring the argument in wireheading, just using an accurate simulated model of the agent at each step in the decision recursion will not save the agent from insanity. If the agent is predicting changes to its model and then using changed models uncritically, that may only make the problem worse.

The decision theory algorithm assumes that the world-model the agent has is accurate and trustworthy. We’ll need to adapt the algorithm to account for world-models that may be untrustworthy.

The thing that makes this difficult is that we don’t want to limit changes to the world-model too much. In some sense, changing the world-model is the way that the agent improves. We even want to allow major changes to the world-model, like perhaps switching from a neural network architecture to something totally different.

Given that we’re allowing major changes to the world-model, we want to be able to trust that those changes are still useful. Once we predict a change to a model, how can we validate the proposed model?

Model Validation

One answer may be to borrow from the machine learning toolbox. When a neural network learns, it is tested on data that it hasn’t been trained on. This dataset, often called a validation set, tests that the network performs well and helps to avoid some common machine learning problems (such as overfitting).

To bring this into the agent model question, we could use the observations that the agent has made to validate the model. We would expect the model to support the actual observations that the agent has made. If a model change is predicted, we could run the proposed model on past observations to see how it does. It may also be desirable to hold out certain observations from the ones generally used for deciding on actions, in order to better validate the model itself.

In the agent model formalism, this might look like:

function decide(state-s):
  max_a = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      if not valid_model(state-s, x):
        utility(a) += Utility(insanity)
      else:
        utility(a) += p(state-s, a -> o; x)*Utility(o)
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
    end if
  end for
  return action max_a

function transition(old_state, action_a):
  return new_state obtained by taking action_a in old_state;

function Utility(test_state):
  if test_state == insanity:
    return value(insanity) // some low value
	
  current_value = value(test_state)
  future_value = value(transition(test_state, decide(test_state)))
  return (current_value + future_value)

In this formalism, we check to see if the model is sane each time before we use it. The valid_model function determines if the model described in state-s is valid given the observations x.

Creating a function that can validate a model given a world state is no easy problem. The validation function may have to deal with unanticipated model changes, models that are very different than the current one, and models that operate using new ontologies.

It’s not totally clear how to define such a validation function, and if we could, that may solve most of the strong AI problem in the first place.

If we don’t care about strong improvements to our agent, then we may be able to write a validation function that disallows almost all model changes. By allowing only a small set of understandable changes, we could potentially create agents that we could be certain would not go insane, at the cost of being unable to grow significantly more sane than they start out. This may be a cost we want to pay.