Corrigibility and Decision Theory

Edited for clarity and style 2018/03/12.

Soares et al. argue that an AI acting to fulfill some utility function given to it by humans may not behave as humans would want. Maybe the utility function specified doesn't match human's actual values, or maybe there's a bug in the AI's code. In any case, we as AI designers want to have a way to stop the AI from doing what it's currently doing.

Naively, one might expect to just be able to hit the off-switch if the AI starts misbehaving. Unfortunately, a sufficiently smart AI may foresee it's creator attempting to turn it off. If it does, it may seek to disable its off switch or manipulate its creator in some way. An AI that respects its off-switch, and doesn't try to get around it, is called corrigible.

The Corrigibility-Wrapper

To create an AI that's corrigible, Soares et al. propose a kind of wrapper around a utility function that makes the utility function corrigible. In their ideal case, any utility function could be placed in the wrapper and made suddenly corrigible. The type of wrapper that they propose just checks the state of the off-switch, then returns either the normal utility function's value if the off-switch isn't pressed. If it is pressed, it returns a utility that's proportional to how off the AI is. More complicated functions of the original utilities are also possible.

In the above utility wrapper, is the normal utility function that we want the AI to pursue. We have as a utility function that incentivizes shutting down. The value is a function of the world state, and serves to allow the resulting utility function to trade-off between the two options in some way.

In their paper, Soares et al. analyze several different ways to make such a wrapper work (including various ways of choosing the function ). Most of the obvious ways to make such a utility wrapper suffer from some severe flaws. Some corrigibility-wrappers cause an AI to want to turn itself off, others don't successfully prevent the AI from manipulating human observers.

Another paper by Armstrong argues for constructing a wrapper that makes the AI indifferent to being turned off or left on: a situation called utility indifference. Utility indifference is achieved by setting the function above such that the expected value equals the expected value at every decision point. In this case, the AI will turn off if the switch is pressed, but it won't actually try to preserve the switch's action. If a wandering gremlin cuts the wires to the off-switch, the AI wouldn't care and wouldn't seek to alert a human about it. The AI thinks it can still achieve the same amount of value as it could before the switch malfunctioned.

Moral Uncertainty

Hadfield-Menell et al. confront this corrigibility problem in a different way. Instead of creating a wrapper around a utility function, they argue that the entire point of the problem is that humans know a utility function and the AI doesn't. As far as Hadfield-Menell et al. are concerned, the AI only knows an approximation of a utility function. The AI is uncertain about what true value is, and we say it has moral uncertainty.

Their insight is that if you correctly inform the AI that you know better, it suddenly becomes very interested in paying attention to the switch because the switch is one good way to learn about the true utility of different actions. Because the switch is informational, the AI also wants to preserve the function of the switch.

The AI may use it's utility-approximating function to come up with some candidate actions. Since it knows that humans have a better understanding of utility, it will seek a human's guidance before carrying out any of those candidate actions. Hadfield-Menell et al. show one concrete game in which this works properly.

The structure of Hadfield-Menell et al.'s off-switch game. Squares indicate decision nodes for the robot R or the human H.

In the game modeled by the above graph, a robot can make a decision to act, turn off, or consult a human. If consulted, the human can then either turn the robot off or not. If the robot is not turned off after consulting a human, then it may take its original action. In this game, the robot is uncertain about the true value of utility (), which may be smaller or greater than . If the AI chooses to take an action, then it may get a negative value or a positive value (it isn't sure). It therefore isn't sure if it's better to turn off or to take the action.

The human knows (approximately) the true value.

In their paper, Hadfield-Menell et al. analyze how accurate the human's belief about the true utility needs to be for the AI to pay attention to the human. They also look at how uncertain the AI needs to be about the true utility to pay attention to the human. They find that there are certain situations in which the AI will attend to the human, and certain situations in which an AI will not.

This problem is explicitly a one-shot game. Can we generalize from the one-shot game to say that an (uncertain enough) AI would pay attention to a (correct-enough) human in an iterated game?

Problems With Moral Uncertainty

Over on Arbital there's an argument that moral uncertainty won't lead to corrigibility. This is basically a direct argument against Hadfield-Menell et al.'s approach applying in the general case. Hadfield-Menell et al. said "if the AI knows the human has a better understanding of utility, it will pay attention to human directions."

In response, the Arbital post says "if the AI thinks it can eventually approximate the human's utility function better than the human can, it may still avoid being turned off because it thinks it will someday know better than the human." The arbital post is arguing that, even if the human knows better now, the AI may still ignore the human now.

If the AI is eventually going to be better than the human at creating a good world, then maybe we should just let it run. But the problem is that maybe there's just a bug in the code, or maybe the AI will cause some serious catastrophes before it gets to "eventually". So even if the AI thinks it will eventually have a good enough understanding of the human's utility function, we still want it to turn off when we tell it to.

The Arbital post ends with a vague argument that decision theory could help to refine the moral uncertainty method of corrigibility. Let's see how this could work.

Decision Theory, Moral Uncertainty, and the Off-Switch Problem

Let's take the Arbital post at face value and assume that we can eventually solve corrigibility with decision theory. To start us off, let's translate Hadfield-Menell et al.'s off-switch game into an iterated decision theory dilemma. We'll follow Yudkowsky's choice of using Pearl's causal models to create a graph of the decision problem. I'll also abuse that notation a bit, by representing states of partial knowledge as probabilities on the graph edges (this is a bit different than is done in most graphical models, but it saves having to draw extra nodes for hidden and observed variables).

Decision Theoretic Off Switch Game Diagram

The above diagram represents Hadfield-Menell et al.'s off-switch game from the perspective of the AI. The square node depicts the decision the AI is making about what action to take. The diamond node represents the value achieved from whatever the AI does (turn off or take an action). The round nodes are other factors about the world that aren't under the AI's control. Arrows in this graph represent causality in some way. For example, the state of the switch may causally impact what the AI decides to do.

The True Utility is a fact about the world that both the AI and the human have partial access to. The human can set the state of the off-switch, and the AI can observe it. The value that the AI can achieve is a function of both the AI's action and the True Utility.

In the diagram, the factors and represent how well the AI thinks that it knows the True Utility, and how well the AI thinks the human knows the True Utility. Hadfield-Menell et al. assume value has a Normal distribution with some mean and variance. This lets them specify  and exactly, and to calculate exact behavior of the AI if those parameters take on different values.

Again, note that both and are describing only the AI's knowledge about the world, and does not represent what the human actually thinks.

If we assume that the above decision game is played once, it is equivalent to Hadfield-Menell et al.'s formulation.

Now let's assume that the above decision game is one step in a repeated chain. At each time step, both the human and the AI have to make similar decisions. They know what happened in the past, yet they face the same decision again now. Let's also assume that the AI has multiple available actions at each timestep, and some actions have positive value and some have negative value.

In this case, we can unroll the diagram above to show multiple steps of the game. In each step, the off-switch state considered by the AI depends only on past information.

Unrolled version of the iterated Hadfield-Menell off-switch game.

In the repeated case, let's say that the probability distribution the human has over the true utility assigns a high probability to what is actually the true utility (the human is well trained and thoughtful).

The human may improve their understanding of True Utility. Imagine that the human expects to get 10 utilons from eating pizza, and then takes actions that lead to eating pizza. By paying attention to how much they actually like it, they can adjust their model of pizza's goodness. The AI can do the same thing by observing the reactions of the human.

This corresponds to distribution for and that are functions of time, where (for example) variance decreases in Hadfield-Menell et al.'s Normal distribution.

After each iteration of the above game, the confidence that the AI has in its own utility-approximation grows. That means that the information it gets from the off-switch state decreases in value. It slowly needs the off-switch less and less, and the AI knows it. In Hadfield-Menell et al.'s original problem, the AI can see exactly what the true value is after one action (assuming it doesn't turn off). In that case, all further actions can be made with complete knowledge.

If we assume that only partial information is available to update the distributions for the utility estimates (for both human and AI), then it will take longer for the AI to get to the point that it has a better estimate than the human for true value. As long as we assume that the AI is better at integrating new information and updating its probabilities, we can predict that the AI will eventually have a better understanding of true utility than the human.

If the AI does some prediction to see what action now makes sense, it will see that it will reach a state where it knows just as well as the human does. Since shutting down means that it gets 0 value, whereas continuing allows it to possibly get arbitrarily high value (assuming any actions with positive value exist), it may take an expected hit in value now so that it can gain a lot later.

Functional Decision Theory

The insight of Hadfield-Menell et al. (and the rest of the moral uncertainty crowd) is that we don't want the AI to be modeling the off-switch as a crucial piece of the world for its own sake. We want the AI to see the off-switch as a source of very important information about the world; preferably information that it can't obtain in another way.

This fails in the above decision theory game because the AI doesn't adequately represent the fact that a human switch operator knows that the AI will predict having a good utility approximation eventually. If a human presses the off-switch, they do it knowing that the AI wants to get high utility and expects to be able to do better in the future. We want to change the above decision problem so that the AI can take this information into account.

Ideally, we can have the AI think to itself as follows: "I know that I could do better in the future if I keep going, and the human knows that too. But the human still pressed the button, so there must be some reason to shut down, even knowing that I'll be better at this later."

There is a standard decision theoretic problem known as Death In Damascus that can help us out here.

Death In Damascus

A merchant in Damascus meets Death in the market one day. Death says to the merchant "hello, I'll be coming for you tomorrow."

The merchant knows death works from an appointment book that specifies with perfect accuracy when and where someone will die. Knowing that Death is in Damascus, the merchant can choose to stay in Damascus and spend their last night with their family (which they value at $1000). Alternatively, the merchant can flee to Aleppo. If the merchant manages to be in a different city from Death on the day they would otherwise die, then the merchant gets to live forever. They value that outcome at $1 million. Should the merchant stay in Damascus or flee?

The above problem description is adapted from Yudkowsky and Soares's Functional Decision Theory paper.

In this case, the merchant sees four potential outcomes:

  1. The merchant stays in Damascus. Death stays in Damascus. Total value: $1000
  2. The merchant stays in Damascus. Death goes to Aleppo. Total value: $1001000
  3. The merchant flees to Aleppo. Death stays in Damascus. Total value: $1000000
  4. The merchant flees to Aleppo. Death goes to Aleppo. Total value: $0

To represent this using Causal Decision Theory, we'll use the formulation from Cheating Death in Damascus.

Death In Damascus using Causal Decision Theory

Much like the decision diagram above, the square box represents the decision that the merchant makes (in this case whether to stay or flee). The diamond box is the ultimate value they get from the world-state that results from their actions. The round nodes are other facts about the world, with arrows indicating causality.

When the merchant thinks "I will go to Aleppo", the merchant knows that their predisposition is to go to Aleppo. They know that the appointment book accurately predicts their predisposition. They thus decide to stay in Damascus, but that leads them to realize that their predisposition is to stay in Damascus. So then they think they should go to Aleppo. The merchant is unable to form a stable decision in this problem.

A causal decision theory cannot adequately deal with the situation, because it cannot account for the fact that Death's appointment book is accurately predicting any decision made by the merchant.

Yudkowsky proposes Functional Decision Theory as a new method of making decisions that does account for this. Crucially, FDT can formally represent the known fact that Death's appointment book is always accurate. Because of that, FDT can accurately rule out options where Death is in a different city than the merchant on their death-day. Therefore, the merchant only has choices available with values of $1000 or $0, and the decision is easy.

Death In Damascus Using Functional Decision Theory

In the above graph, the merchant is not deciding what act to perform by imagining doing different acts. The merchant is deciding what act to perform by imagining different ways of deciding. That difference is indicated by the square decision box being the predisposition/FDT() box instead of the merchant's act.

An FDT merchant thinks "the output of my own decision procedure is what drive's death to be in a specific location. It is therefore impossible to escape death tomorrow. Because of that, the best I can do is wait in Damascus and meet my death tomorrow."

AI Corrigibility via FDT

To map Death In Damascus to the AI corrigibility problem, imagine that the AI is the merchant. The human pressing the off-switch is equivalent to the merchant hearing Death's message "I'm coming for you tomorrow." The Death in Damascus problem has the message-bearer and the source of low utility as the same agent (Death). In the AI/human case, the message bearer is the human, but the source of low utility is just the world in general after the AI takes some actions.

Now translate "stay in Damascus" to "shut down properly". One day the AI sees the human press the switch. The AI then sees four options available to it.

  1. The AI shuts down nicely. The AI does not cause a catastrophe due to future actions. Total value: $1000
  2. (there's no analog for option 2 in the AI story, since the AI shutting down precludes AI-caused catastrophes)
  3. The AI continues operating. The AI does not cause a catastrophe due to future actions. Total value: $1001000
  4. The AI continues operating. The AI does cause a catastrophe due to future actions. Total value: $0

We represent this problem with the same functional graph as Death In Damascus, above. In this case, the human is simulating the behavior of the AI to predict catastrophic outcomes. When the AI sees the switch has been pressed, it knows that the human has already simulated the AI's behavior. Any option the AI chooses aside from shutting down has been seen by the human to lead to catastrophe, so the AI knows that it is choosing only from options 1 and 4. Option 3 is not available to it.

There are several natural objections to this:

  1. a human can't predict a super-human AI
  2. why would the AI believe that the human has a good idea of what an AI would consider a catastrophe, even given that the human is able to predict the AI?

Let's address the second objection first. We integrate moral uncertainty, similar to Hadfield-Menell et al., into the problem in order to represent the idea that the AI needs to learn a utility function from the human.

Corrigibility Via Functional Decision Theory

This graph includes the moral uncertainty of Hadfield-Menell et al.'s approach, but relies on Functional Decision Theory to adequately represent the information that the human is taking into account when throwing the switch. The above graph represents the distributions and in the same way as in Hadfield-Menell et al.'s example. The distribution represents how accurately the human can predict the AI's decision process.

To make the use of the off-switch more clear, it's also possible to unroll the diagram and show two different time steps at once. The following diagram shows the off-switch state as being based only on prior actions (not the action currently being considered).

Two timesteps of the time-unrolled FDT corrigibility problem.

In Hadfield-Menell et al.'s original decision theory problem, the only thing that mattered was what the AI thought the knowledge parameters were. The AI knew how well it could represent true utility, how well the human could represent true utility, and that was enough to solve the problem. In the iterated FDT case, the human is also simulating the AI (as well as they can). Therefore, the distributions , , and need to be known to both human and AI. In the following, we'll assume that the human and AI magically have common knowledge of those parameters.

This formulation of the corrigibility problem explicitly represents the fact that human values and AI values are linked. If the human thinks an outcome will have very low value, that actually does give information to the AI about what the AI values.

Our answer to objection 2 (regarding value) leads to an intuition for objection 1 (regarding predicting super-human AI). The human is unable to accurately predict what move a super-human AI will make. This does not mean that the human has a uniform distribution over AI actions. The human knows that the AI will seek to accomplish what the AI thinks has high value. Therefore, the human can predict that anything the AI does will move it towards more of what the AI considers valuable. This gives the human some predictive power, so the switch may remain informative to the AI.

Colloquially, the AI will attend to the switch and react correctly if it thinks that the user of the switch can adequately help the AI to avoid situations that the AI would agree are terrible. As long as this is true, the AI would also attempt to improve the fidelity of the information from the switch (this corresponds to taking actions that make , , and more accurate).  Morally uncertain FDT AI lend credence to Paul Christiano's idea of a "basin of corrigibility", given that they will attempt to improve a human's understanding of itself and of true value.

Next Steps and Thoughts

The above Functional Decision Theory argument is just an intuitive sketch. It seems clear that there are some values of  and that disagree enough that the AI would no longer trust the human. It also seems clear that, if the human has a poor enough understanding of what the AI is going to do then the AI would also not listen to the human.

At this point, it seems worth repeating a variant of Hadfield-Menell et al.'s off-switch game experiments on an FDT agent to determine when it would pay attention to its off-switch.

AIY Voice Kit Project: Story Listener

Here's the git repo for this project.

My wife and I love fairy tales and short stories. When we were first dating, one of the ways that we bonded was by telling each other silly bedtime stories. Every once in a while, she likes one of the stories enough to get out of bed and write it down. At some point, we might have enough of these stories to put out a collection or something.

The problem is that coming up with silly stories works a little better when you're very tired (they get sillier that way). That's also the time you least want to write them down. What we needed was some way to automatically record and transcribe any stories that we tell each other. When one of my friends gave me an AIY Voice Kit, my wife knew exactly what we should do with it.

The Story Listener


The AIY Voice Kit gives you all the power of Google Home, but completely programmable. You just need to add a Raspberry Pi to do the processing. Most of the voice commands and speech processing are done in the cloud, so once you set up get set up with Google's API you can make full use of their NLP models (including the CloudSpeech API).

As an aside, the Voice Kit only works with newer models of Raspberry Pi. When I pulled out my old pi, the kit booted but wouldn't run any of the examples. Turns out you need a Raspberry Pi 2B or newer. A quick Amazon Prime order got us going again.

Our plan was to make an app that would listen for the start of a story. Once it heard a story start, it would record the story, transcribe it, and then email the transcription to us.

Getting Started with the API

Most of the Voice Kit projects rely on Google APIs that require access permissions to use. The API and permissions need to be enabled for the Google account you're using with a Voice Kit project. You'll need to set that up and then download the json credential file to do anything interesting.

Detecting When a Story Started

To make story detection easier, we decided to preface all of our stories with one of a few different sentences. We chose "Once upon a time" and "Tell me a story" as good options. Detecting these key phrases using the Google CloudSpeech API is pretty easy.

The CloudSpeech API has a nice library associated with it in the Voice Kit library. You can create a recognizer object that sends audio to the API, and you'll get back strings that contain the text from the audio. You can improve the recognition accuracy by telling the recognizer to expect certain phrases.

import aiy.cloudspeech
recognizer = aiy.cloudspeech.get_recognizer()
recognizer.expect_phrase("once upon a time")

# waits for audio, then transliterates it
text = recognizer.recognize() 
# the transliteration doesn't have guarantees 
# about case, so take care of that here
text = text.lower() 
if ("once upon a time" in text):

The expect_phrase method improves the voice recognition accuracy of that particular phrase. Then you can search for that phrase in whatever text the CloudSpeech API finds. If you see your key-phrase, then it's time to move on to the next step.

Recording Audio with the Voice Kit

The Voice Kit library allows various "processors" to be added to the audio stream coming from the microphone. The processor is just a class that operates on the audio data (the recognizer is one such processor). In order to record audio while still detecting key-words. It turns out that the AIY library even had a WaveDump class that would save audio to a file.

The WaveDump class was almost exactly what we were looking for, but had a couple of drawbacks. It was originally designed to record audio for a certain length of time, and we wanted to record audio until a story was over (which we would recognize by listening for "the end"). We created a sub-class of the WaveDump class to allow us to have more control over how long we recorded audio for.

class StoryDump(
    def __init__(self, filepath, max_duration):
        # just do the normal file setup
        super().__init__(filepath, max_duration)
        # keep track of whether we should end the recording early
        self.done = False 
    def add_data(self, data):
        # keep track of the number of bytes recorded
        # to be sure that we don't write too much
        max_bytes = self._bytes_limit - self._bytes
        data = data[:max_bytes]
        # save the audio to the file
        if data and not self.done:
            self._bytes += len(data)
    def finish(self):
        self.done = True
    def is_done(self):
        return self.done or (self._bytes >= self._bytes_limit)

With this class now defined, it's easy to add an instance of it as a processor to the audio stream.

# assume all stories are < 20min
story_wav = StoryDump("filename.wav", 20*60)

And once you see that the story is over, you can finish the recording like so:

recognizer.expect_phrase("the end")
if "the end" in text:

Because we're already using the CloudSpeech API to transliterate audio and look for keywords, the story transcription happens almost for free. All we have to do is wait until a story starts (looking for one of the keyphrases in the text), and then write all subsequent text to a file. Emailing the file once it's done is also a straightforward exercise in python. Once you have the audio recognition, transcription, and saving done, making the project start when the Raspberry Pi boots is also just a linux exercise.

One slightly annoying aspect of the Voice Kit library is that isn't a complete Python Package. That means that you can't install it with setuptools or pip, so accessing the library is a bit annoying. The examples for the VoiceKit all recommend putting your application code in the same directory as the Voice Kit library. This is a bit annoying when you want to create a repo for your project that isn't a fork of the Voice Kit repo. We fixed this by creating an environment variable that pointed to the location of the AIY library.


The CloudSpeech API works better than I expected it to, but it is definitely not yet good enough to use for transcription. It will often mess up tenses on verbs, skip transcription of definite and indefinite articles, and select words that are close homonyms to what was actually said. I think that some of this is that the API is probably doing some analysis of how much the text makes sense. If you're telling a silly absurdist story, you're likely to string words together in a way that isn't high probability in standard usage.

once upon a time there was a time girl made completely from clay
she was very energetic and like to run all over the place
and one day she ran so fast her clay arms and legs stretched out from the wind
and then she wasn't such a tiny girl anymore
was actually a very very tall and skinny girl
the end

Here's the wave file that's from:

Another limitation of the CloudSpeech API for story transcription is the latency. The API seems to be intended mostly for interactive use: you say a thing, it says a thing, etc. Since we just want to transcribe a long series of utterances without pausing, this causes issues. It seems that the recognizer will wait until a pause in the voice, or until there's some number of words available, then it will try to recognize all of it. This has some delay, and any words said during that delay will be missed (they still get recorded, just not transcribed). We want to have on-line transcription so that we know when the story is over, but it may make sense to then re-transcribe the save audio all at once.

Next Steps

We're pretty happy with how the story listener is working out. It would be nice to have better transcription, but I expect that will come in time.

For me, the biggest issue with the Voice Kit in general is the privacy concern. If we have it listening for stories all the time, then it's going to be continually sending anything we say (in bed, where we tell bedtime stories) to Google. That's not exactly what I want.

The Voice Kit manual advertises support for TensorFlow, but there aren't any good tutorial for integrating that yet. It looks like the best way to integrate an ML model with the Voice Kit would be to create a new audio processor to add to the recorder. That audio processor could tensor-ize the audio and feed it through a classification model.

Once we get that figured out. it might be worth trying to recognize a few key phrases. Running a model on the Raspberry Pi itself would make the device independent of an internet connection, and would solve a lot of the privacy concerns that we have. The transcription would probably go down in accuracy a lot, but if we're already manually transcribing stories that might be fine.

Meditating on Fixed Points

Epistemic Status: Almost certainly wrong, but fun to think about.

A fixed point theorem says that, as long as certain conditions are satisfied, a function that has the same domain and range will have at least one point that gets mapped to itself. The best example of this is Brouwer's fixed point theorem, which proves the existence of fixed points for continuous functions on a convex and compact set. There are other fixed point theorems that apply in other cases.

These would be mildy interesting factoids if it weren't possible to represent an enormous number of common tasks in life as functions on a set. In fact, thinking itself could be represented as a function. Specifically, you could represent a thought as a function that maps one point in mind-space to another (nearby) point in mind-space.

If your mind when you wake up is at one point, then when you think about breakfast your mind is now at a different point.

In that case, we can ask if there is a fixed point in such a circumstance. If there is, we can ask what that fixed point might be.

I certainly don't know enough about neuroscience yet to figure out what properties the set of minds has, or what properties the function of thought has. But I'm more interested in the second question anyway: assuming a fixed point of mind exists, what is it?

A fixed point in mind-state is a point where, once you reach it, the act of thinking doesn't take you away from it. Since thinking is a function implemented by the mind, a fixed point in mind-space endorses its own existence.

One of the interesting fixed points that may exist for mind-space is probably enlightment. In fact, meditation as a search for enlightenment seems to be a search function implemented on the mind. You start with your mind as it is, and then successively apply the meditation function until you get to the fixed point.

In that case, you could ask if such a search always succeeds. It seems clear the the answer is no. In fact, people with certain mental or emotional disorders are often advised not to start meditating. You probably want to search for the fixed point of meditation only when you're within a topological basin of attraction for it. So it may be worth e.g. getting therapy to put yourself into the basin of attraction for enlightenment before beginning meditation.

Furthermore, doing some kind of iterated search through mind-space isn't guaranteed to ever converge. I know I'll often cycle on some subject "I should call so-and-so. But what if she's mad about the thing I said last week? I wonder if she is. I should call her." And then those thoughts go around a bunch more times. In this case, the thought-function doesn't converge. It seems likely that there are many cycles of this type, perhaps much longer than can be readily noticed by introspection.

This is why just closing your eyes and letting your mind drift is insufficient as meditation. Proper meditation must be a though-function that, for a large set of mind-states, does converge to some fixed point.

It also becomes clear that closing your eyes, and in general just avoiding distractions, is also important for seeking a fixed point in mind-space. The more inputs you have, the more complex a search function would need to be. This implies that enlightenment (if it is a fixed point) may actually be more of a moving target. As you interact with the world and learn things, your mind-state will necessarily change. Perhaps it changes in a way that's easy to adjust to a new fixed-point, and perhaps not.

Finally, it seems likely that fixed points in mind-space aren't necessarily good. Wireheading, for instance, seems like it could be represented as a fixed point. Just because a point in mind-space is stable doesn't mean it satisfies your goals right now.

Brouwer and the Mountain

A few years ago, one of my friends told me the following riddle:

A mountain climber starts up a mountain at 8am. They get to the top that day, and camp there. In the morning, they start hiking down the mountain at 8am on the same trail.

Is there a time of day at which they're at the same spot on the trail the second day as they were on the first?

I thought about this a while before finally asking for the answer (which I won't repeat here). I will say that you don't have to make any assumptions about hiking speed, rest breaks, or even that the hiker always heads in the same direction.

When I learned about Brouwer's fixed point theorem, I immediately thought back to this riddle. The answer to the riddle is a straightforward application of Brouwer's theorem.

It turns out that Brouwer's theorem is used in all sorts of places. It was one of the foundations that John Nash used to prove the existence of Nash equilibria in normal form games (for which he won the Nobel).

The moral of the story is: the more riddles you solve, the more likely you are to get a Nobel prize.

No 2017 Resolutions

In past years, I've focused heavily on yearly planning and life-goals for new-years eve. I'm finding that this year, I don't feel at all motivated to do that.

I think part of that is that I'm in the middle of a big project right now, and making plans and goals before I finish up that project is jumping the gun. Without completing the project I'm working on, I won't know where I want to go. So when I think about doing life-planning or goal setting, I just think it would be better to spend that time actually working on my project.

This is a bit of an odd feeling for me. I've been so focused on goal-setting for years that it almost doesn't make sense that I wouldn't want to do it. I take this as evidence that I'm doing what I currently want to be doing. Perhaps in past years I've been less satisfied with my life, and now that things are going well for me I feel less of an impulse to change things.

I am worried that this isn't a generally positive change. Creating detailed life-plans seems helpful no matter where you want to be. I'm now a third of the way through this project; shouldn't it make sense that I re-evaluate my strategy and figure out what makes the most sense to do next?

My plan for tomorrow is to re-visit my short term goals, and then set aside some time for long term goal planning near the end of my project.

Pursuit of Happiness

Life, liberty, and the pursuit of happiness. When I first learned about the US constitution, I thought the pursuit of happiness was an odd choice there. What did that have to do with government. Certainly the government shouldn't kill people, and certainly it shouldn't deprive them of freedom, but the pursuit of happiness is an internal thing. How could a government have anything to do with that?

I've been reading a history of peri-enlightenment France called "Passionate Minds" recently. It argues that the pursuit of happiness is actually the most subversive of the three unalienable rights. Turns out that monarchies often take their power as a divine gift. In that case, common people are spiritually bound to work for the monarch. Working for yourself is just an affront to god.

Many people in Christendom seem to have viewed life as a suffer-fest that they worked at so that they could get to heaven. Even if they thought they could improve their life, it wouldn't have seemed acceptable to try. Making the pursuit of happiness a right is directly contradicting much church doctrine of the time.

Christmas Spirit

Christmas is a weird time for me.

When I was a kid, Christmas was the time that I had to be very careful to not show either of my (divorced) parents more favor than the other. There was a lot of careful planning among my whole family to make sure that Christmas was evenly divided. I had to be sure to play with all my toys in front of the people who gave them to me. I had to be sure to spend an equal amounts of time with each parent. I had to be sure I told everyone that I loved them. My feeling around Christmas was one of brittleness, of walking on eggshells. All of my Christmas traditions were things I did to show that I cared about someone. They were mostly about display.

My wife had a very different Christmas experience growing up, and it's been a bit of a trip getting used to it. Christmas for her is a series of traditions done out of fun and joy. The strange thing to me is that her traditions are mostly the same things, but they feel very different when I do them with her and her family.

I'm realizing that my walking-on-eggshell feeling at childhood Christmases was mainly an internal thing. As I let go of the need to manage other peoples' feelings, Christmas gets more fun even with my family. This year I even enjoyed my own family Christmas traditions when visiting my mom. They weren't a thing I had to be sure to do right, at risk of hurting a loved one. They were a thing that we could all just enjoy doing together.

I am taking more of a risk that I offend someone, but I'm also feeling less like that's the most important thing. If someone gets offended or upset, that now seems like a chance to talk about real feelings and figure things out. As a kid I felt that any problem was world-ending, and I'm now realizing that most of my most feared interpersonal problems are recoverable.

The best thing for me about this new way of doing Christmas has been a deep sense of being at home. Sitting together and looking at a Christmas tree, or at the snow outside, took on a deeper sense of meaning than it ever has for me. I had a sense of being connected, not just to my family, but to all of the people throughout history who have looked with joy at new fallen snow. I had a sense of my own place in the world, which I'm not sure I'd noticed I'd never had.

Today, after the torn bits of paper and string had been cleaned away, I sat and looked out at the snow with my wife and had a great internal feeling of peace. A true feeling of Christmas spirit.

What's funny is how much that Christmas spirit worried me. As soon as I started noticing myself feeling like things were fine, I started worrying that I'd lose all motivation. If I'm happy and at peace, why work to make the world a better place?

I think my worry points to something important about the reasons that I do non-Christmas things. Childhood Christmases were all about making sure I did things to let people know I cared about them, not actually about enjoying the day or actually even caring about them. Perhaps many of my motivations for the rest of my life are based on similar foundations.

I'd like to enjoy my life and have a deep sense of meaning from everything I do. I'd also like to make people's lives better and do as much as I can to fix the world's problems. On the surface these things don't seem incompatible. This is something I'll be exploring in the year to come.

Maybe next Christmas I won't be so surprised, or worried, when I feel the Christmas-spirit coming on.

In praise of Ad Hominems

Ad hominems get a bad rap.

Specifically, there are instances where knowing that the person who thought up an idea has certain flaws is very useful in evaluating the idea.

In the best case scenario, I can evaluate every argument I hear on its own merits. Unfortunately, I'm often too busy to put enough time into every argument that I hear. I might just read enough of an argument to get the gist, and then move on to the next thing I'm interested in. This has bitten me a few times.

If I know that the author of an article is intellectually sloppy, that actually helps me quite a bit when it comes to evaluating their arguments. I'll put more time into an article they've written, because I now feel that its more important to evaluate it for myself.

If I know more specifically that an author doesn't understand supply and demand (or whatever), then that tells me exactly what parts of their argument to hone in on for more verification.

The general case of just dismissing an argument because the person making it has some flaw does still seem bad to me. It makes sense to know what kind of person is giving the argument, because that can point you at places that the argument may be weakest. This allows you to verify more quickly whether you think the argument itself is right.

Ad hominems shouldn't end an argument, but they can be a useful argument direction-finder.

Seeing problems coming

I've written a lot about agent models recently. The standard expectation maximization method of modeling agents seems like it's subject to several weaknesses, but there also seem to be straightforward approaches to dealing with those weaknesses.

1. to prevent wireheading, the agent needs to understand its own values well enough to predict changes in them.
2. to avoid creating an incorrigible agent, the agent needs to be able to ascribe value to its own intentions.
3. to prevent holodeck addiction, an agent needs to understand how its own perceptions work, and predict observations as well as outcomes
4. to prevent an agent from going insane, the agent must validate its own world-model (as a function of the world-state) before each use

The fundamental idea in all of these problems is that you can't avoid a problem that you can't see coming. Humans use this concept all the time. Many people feel uncomfortable with the idea of wireheading and insanity. This discomfort leads people to take actions to avoid those outcomes. I argue that we can create artificial agents that use similar techniques.

The posts linked above showed some simple architecture changes to expectation maximization and utility function combinations. The proposed changes mostly depend on one tool that I left unexplored: representing the agent in its own model. The agent needs to be able to reason about how changes to the world will affect its own operation. The more fine-grained this reasoning can be, the more the agent can avoid the above problems.

Some requirements of the world-model of the agent are:

  • must include a model of the agent's values
  • must include all parts of the world that we care about
  •  must include the agent's own sensors and sense methods
  • must include the agent's own thought processes

This is a topic that I'm not sure how to think about yet. My learning focus for the next while is going to shift to how models are learned (e.g. through reinforcement learning) and how agent self-reflection is currently modeled.

Agent Insanity

The wireheading and holodeck problems both present ways an agent can intervene on itself to get high utility without actually fulfilling its utility function.

In wireheading, the agent adapts its utility function directly so that it returns high values. In the holodeck problem, the agent manipulates its own senses so that it thinks it's in a high value state. Another way that an agent can intervene on itself is to manipulate its model of the world, so that it incorrectly predicts high valued states even given valid observations. I'll refer to this type of intervention as inducing insanity.

Referring again to the decision theoretic model, agents predict various outcomes for various actions, and then evaluate how much utility they get for an action. This is represented symbolically as p(state-s, a -> o; x)*Utility(o). The agent iterates through this process for various options of action and outcome, looking for the best decision.

Insanity occurs whenever the agent attempts to manipulate its model of the world, p(state-s, a -> o; x), in a way that is not endorsed by the evidence the agent has. We of course want the agent to change its model as it makes new observations of the world; that's called learning. We don't want the agent to change its model just so it can then have a high reward.

Insanity through recursive ignorance

Consider an agent that has a certain model of the world being faced with a decision whose result may make its model become insane. Much like the wireheading problem, the agent simulates its own actions recursively to evaluate the expected utility of a given action. In that simulation of actions, one of those actions will be the one that degrades the agent's model.

If the agent is unable to represent this fact in its own simulation, then it will not be able to account for it. The agent will continue to make predictions about its actions and their outcomes under the assumption that the insanity-inducing act has not compromised it. Therefore the agent will not be able to avoid degrading its prediction ability, because it won't notice it happening.

So when recursing to determine the best action, the recursion has to adequately account for changes to the agent's model. Symbolically, we want to use p'(state-s, a -> o; x) to predict outcomes, where p' may change at each level of the recursion.

Predicting your decision procedure isn't enough

Mirroring the argument in wireheading, just using an accurate simulated model of the agent at each step in the decision recursion will not save the agent from insanity. If the agent is predicting changes to its model and then using changed models uncritically, that may only make the problem worse.

The decision theory algorithm assumes that the world-model the agent has is accurate and trustworthy. We'll need to adapt the algorithm to account for world-models that may be untrustworthy.

The thing that makes this difficult is that we don't want to limit changes to the world-model too much. In some sense, changing the world-model is the way that the agent improves. We even want to allow major changes to the world-model, like perhaps switching from a neural network architecture to something totally different.

Given that we're allowing major changes to the world-model, we want to be able to trust that those changes are still useful. Once we predict a change to a model, how can we validate the proposed model?

Model Validation

One answer may be to borrow from the machine learning toolbox. When a neural network learns, it is tested on data that it hasn't been trained on. This dataset, often called a validation set, tests that the network performs well and helps to avoid some common machine learning problems (such as overfitting).

To bring this into the agent model question, we could use the observations that the agent has made to validate the model. We would expect the model to support the actual observations that the agent has made. If a model change is predicted, we could run the proposed model on past observations to see how it does. It may also be desirable to hold out certain observations from the ones generally used for deciding on actions, in order to better validate the model itself.

In the agent model formalism, this might look like:

function decide(state-s):
  max_a = 0
  for a in available actions:
    utility(a) = 0
    for outcome o in possible outcomes:
      if not valid_model(state-s, x):
        utility(a) += Utility(insanity)
        utility(a) += p(state-s, a -> o; x)*Utility(o)
    end for
    if (max_a == 0 or (utility(a) > utility(max_a)))
      max_a = a
    end if
  end for
  return action max_a

function transition(old_state, action_a):
  return new_state obtained by taking action_a in old_state;

function Utility(test_state):
  if test_state == insanity:
    return value(insanity) // some low value
  current_value = value(test_state)
  future_value = value(transition(test_state, decide(test_state)))
  return (current_value + future_value)

In this formalism, we check to see if the model is sane each time before we use it. The valid_model function determines if the model described in state-s is valid given the observations x.

Creating a function that can validate a model given a world state is no easy problem. The validation function may have to deal with unanticipated model changes, models that are very different than the current one, and models that operate using new ontologies.

It's not totally clear how to define such a validation function, and if we could, that may solve most of the strong AI problem in the first place.

If we don't care about strong improvements to our agent, then we may be able to write a validation function that disallows almost all model changes. By allowing only a small set of understandable changes, we could potentially create agents that we could be certain would not go insane, at the cost of being unable to grow significantly more sane than they start out. This may be a cost we want to pay.