I’ve written a lot about agent models recently. The standard expectation maximization method of modeling agents seems like it’s subject to several weaknesses, but there also seem to be straightforward approaches to dealing with those weaknesses.
1. to prevent wireheading, the agent needs to understand its own values well enough to predict changes in them.
2. to avoid creating an incorrigible agent, the agent needs to be able to ascribe value to its own intentions.
3. to prevent holodeck addiction, an agent needs to understand how its own perceptions work, and predict observations as well as outcomes
4. to prevent an agent from going insane, the agent must validate its own world-model (as a function of the world-state) before each use
The fundamental idea in all of these problems is that you can’t avoid a problem that you can’t see coming. Humans use this concept all the time. Many people feel uncomfortable with the idea of wireheading and insanity. This discomfort leads people to take actions to avoid those outcomes. I argue that we can create artificial agents that use similar techniques.
The posts linked above showed some simple architecture changes to expectation maximization and utility function combinations. The proposed changes mostly depend on one tool that I left unexplored: representing the agent in its own model. The agent needs to be able to reason about how changes to the world will affect its own operation. The more fine-grained this reasoning can be, the more the agent can avoid the above problems.
Some requirements of the world-model of the agent are:
- must include a model of the agent’s values
- must include all parts of the world that we care about
- must include the agent’s own sensors and sense methods
- must include the agent’s own thought processes
This is a topic that I’m not sure how to think about yet. My learning focus for the next while is going to shift to how models are learned (e.g. through reinforcement learning) and how agent self-reflection is currently modeled.