Journal Club: Corrupt Reward Channels

A MidJourney sketch of a gnome searching for reward — MidJourney sketch of a gnome searching for reward

Imagine you’re a person. Imagine you have goals. You wander around in the world trying to achieve those goals, but that’s hard. Sometimes it’s hard because you’re not sure what to do, or what you try to do doesn’t work. But sometimes it’s hard because what you thought was your goal actually sucks.

Imagine you want to be happy, so you try heroin. It works alright for a bit, but you end up more and more miserable. Your health, happiness, and safety decline significantly. What happened here? It seemed so promising when you started.

This is the problem of corrupt reward channels. What seems good while you’re doing it is, from the outside, obviously very bad.

Now imagine you’re a person with goals, but you want to build a robot to achieve those goals. How do you design the robot to deal with corrupted reward channels?

Reinforcement Learning with a Corrupted Reward Channel

In 2017, a team from DeepMind and the Australian National University released a paper about how to handle the problem of corrupt reward channels in reinforcement learning robots. They came to the very sad conclusion that you can’t deal with this in general. There’s always some world your robot could start in that it would be unable to do well at.

Obviously things aren’t as dire as that, because humans do alright in the world we start in. The DeepMind team looked at how various tweaks and constraints to the problem could make it easier.

To start with, they formalize what they call a CRMDP (Corrupt Reward Markov Decision Process). This is a generalization of a normal Markov Decision Process where the reward that the robot sees is some corruption function of the real reward. The robot wants to maximize true reward, but only sees the output of this corruption function. The formalization itself just add the corruption function to the set of data about the MDP, and then specifies that the agent only sees the corrupted reward and not the real reward.

From this, they derive a “no free lunch” theorem. This basically says that you can’t do better in some possible worlds without doing worse in other worlds. If you already know what world you’re in, then there’s no corrupt reward problem because you know everything there is to know about the real world and real reward. Since you don’t know this, there’s ambiguity about what the world is. The world and the corruption function could just happen to coincidentally cancel out, or they could just happen to provide the worst possible inputs for any given reinforcement learning algorithm.

To do any better, they need to make some assumptions about what the world really looks like. They use those assumptions to find some better bounds on how well or poorly a RL agent can do. The bounds they look at are focused around how many MDP states might be corrupted. If you know that some percentage of states are non-corrupt (but don’t know which ones), then you can design an algorithm that does ok in expectation by randomizing.

Here’s how it works: the agent separates exploration and exploitation into completely distinct phases. It attempts to explore the entire world to learn as much as it can about observed rewards. At the end of this, it has some number of states that it thinks are high reward. Some of those might be corrupted, so instead of just going to the highest reward state, it randomly selects a state from among the top few.

Because there’s a constraint on how many states are corrupt, it’s unlikely that all the top states are bad. By selecting at random, the agent is improving the chances that the reward it gets is really high, rather than corrupted.

Humans sometimes seem to do reward randomization too. The paper talks about exploring until you know the (possibly-corrupted) rewards, then randomly choosing a high reward state to focus on. Many humans seem to do something like this, and we end up with people focusing on all sorts of different success metrics when they reach adulthood. Some people focus on earning money, others on being famous, others on raising a family, or on hosting good game nights, or whatever.

Other humans do something a bit different. You’ll often see people that seem to choose a set of things they care about, and optimize for a mix of those things. Haidt talks about this a bit when he compares human morality to a tongue with six moral tastes.

Seeing Problems Coming

While randomizing what you optimize for may improve things a bit, it’s really not the best approach. The ideal thing to do is to know enough about the world to avoid obviously bad states. You can’t do this in a standard MDP, because each state is “self-observing”. In a normal MDP, you only know the reward from a state by actually going there and seeing what you get.

The real world is much more informative than this. People can learn about rewards without ever visiting a state. Think about warnings to not do drugs, or exhortations to work hard in school and get a good job. This is information provided in one state about the reward of another state.

The DeepMind team behind the CRMDP paper calls this “decoupled reinforcement learning”. They formalize it by creating an MDP in which each state may give information about the reward of other states. Some states may not give any info, some may give info about only themselves, or only another one, or both. The way they formalize this in their new MDP, they assume that each state has a set of observations that can be made from it. Each time the agent visits that state, it randomly samples one of those observations.

A few years ago, I was thinking a lot about how an AI agent could see problems coming. Similar to the CRMDP paper, I thought that an agent that could predict low reward in a state it hasn’t been to yet would be able to avoid that state. My approach there was mainly to modify the agent-dynamics of an expectation maximizer. I wasn’t focused on how the agent learns in the first place.

I was doing engineering changes to a formal model of an agent. In this paper, they create a general specification of a way that agents can learn in the world. This is a lot more flexible, as it supports both model-free and model-based RL. It also doesn’t require any other assumptions about the agent.

It turns out that a lot of people have been thinking about this idea of seeing problems coming, but they’ve generally framed it in different ways. The CRMDP paper calls out three specific ways of doing this.

CIRL

Cooperative Inverse Reinforcement Learning (CIRL), proposed by a team out of UC Berkeley back in 2016, allows an agent to learn from a human. The human know they’re being observed, and so engages in teaching behaviors for the agent. Because the human can engage in teaching behaviors, the agent can efficiently learn a model of reward.

An important note here is that this is a variant of inverse reinforcement learning. That means the agent doesn’t actually have access to reward (whether corrupted or real). Instead the agent needs to create a model of the reward by observing the human. Because the human can engage in teaching behaviors, the human can help the agent build a model of reward.

The corrupt reward paper uses this as an example of observing states that you’re not actually in, but I’m not sure how that works in the original formalism. The human knows true reward, but the formalization of CIRL still involves only seeing the reward for states that you’re in. The agent can rely on the human to help route around low rewards or corruption, but not to learn about non-visited states.

Stories

Another approach uses stories to teach agents about reward in the world. This approach, which the CRMDP paper calls Learning Values From Stories (or LVFS), is as simple as it sounds. Let the agent read stories about how people act, then learn about the world’s real reward function from those stories. It seems that LLM’s like GPT could do something like this fairly easily, though I don’t know of anyone doing it already.

Let’s do a little digression on learning values from stories. The CRMDP paper gives the following example of corruption that may happen for a LVFS agent: an “LVFS agent may find a state where its story channel only conveys stories about the agent’s own greatness.”

As written, I don’t think this is right. If the agent finds a story that says literally “you’re great”. It doesn’t need to keep listening to that story. In the formalism of decoupled MDPs, $\langle s', \hat{R}_s(s') \rangle$ is large $\forall s'$ . That being the case, the agent can go do whatever it wants sure in its own greatness, so it might as well get even greater by maximizing other reward. In other words, this specific story seems like it would just add a constant offset to the agent’s reward for all states.

A natural fix to make the agent wirehead this story is “agent finds state where its story channel conveys the agent’s own greatness for listening to this story“.

This gestures at something that I think is relevant for decoupled CRMDPs in general: you can’t necessarily trust states that are self-aggrandizing. Going further, you may not want to trust cliques of states that are mutually aggrandizing if they aren’t aggrandized by external states. It would be pretty interesting to see further exploration of this idea.

Semi-Supervised RL

The last approach the CRMDP paper compares is Semi-Supervised Reinforcement Learning (SSRL) as described in a paper by Amodei et. al. In that paper (which also references this blog post), they define SSRL as “ordinary reinforcement learning except that the agent can only see its reward on a small fraction of the timesteps or episodes.” This is decidedly not a decoupled MDP.

Instead of being able to see the value of other states from the current state, you only get the value of the current state. And you only get the current state’s value sometimes. This is a sparse-reward problem, not a decoupled MDP.

The solution explored in SSRL is interesting, as it involves learning an estimated reward function to “upsample” the sparse rewards. The problem is that you’re still only seeing rewards for states that you’ve been (or perhaps states similar to ones you’ve been). Vastly different states would not have a reliable result from the learned reward predictor.

Decoupled Reward

The CRMDP paper claims that all of these methods are a special case of the decoupled RL problem. While the source descriptions of those methods often have explicit claims that imply non-decoupled MDP, their algorithms do naturally accommodate the decoupled MDP formalism.

In CIRL and SSRL, the reason for this is simply that they involve creating a reward-prediction model to supplement any reward observation. By adding the decoupled observations to that model, it can learn about states it hasn’t visited. Further modifications should be explored for those algorithms, because they don’t assume possible corruption of reward or observation channels.

The paper then goes on to make some claims about performance bounds for decoupled-CRMDP learners. Similar to the non-decoupled case above, they make simplifying assumptions about the world the agent is exploring to avoid adversarial cases.

Specifically, they assume that some states are non-corrupt. They also assume that there’s a limit on the number of states that are corrupted.

The non-corrupt state assumption is pretty strong. They assume there exists some states in the world for which real true rewards are observable by the agent. They also assume that these safe states are truly represented from any other state. Any state in the world will either say nothing about the safe states, or will accurately say what their value is.

Given these assumptions, they show that there exists a learning algorithm that learns the true rewards of the world in finite time. The general idea behind this proof is that you can institute a form of voting on the reward of all the states. Given that there’s a limit on the number of states, a high enough vote share will allow you to determine true rewards.

The implications the paper draws from this differ for RL, CIRL, LVFS, and SSRL. They find that RL, as expected, can’t handle corrupted rewards because it assumes the MDP is not decoupled. They then say that CIRL has a short time horizon for learning, but LVFS and SSRL both seem likely to do well.

I don’t follow their arguments for SSRL being worth considering. They say that SSRL would allow a supervisor to evaluate states $s'$ while the agent is in a safe lab state $s$ . I agree this would help solve the problem. This doesn’t seem all all like what SSRL is described as in the referenced paper or the blog post. Those sources just describe an estimation problem to solve sparse rewards, not to deal with decoupled MDPs. If the authors of the CRMDP paper are making the generous assumption that the SSRL algorithm allows for supervisor evaluations in this way, I don’t understand why they wouldn’t make a similar assumption for CIRL.

For me, only LVFS came away looking good for decoupled-CRMDPs without any modification. Formally, a story is just a history of state/action/reward tuples. Viewing those and learning from them seems like the most obvious and general way to deal with a decoupled MDP. The assumption that an SSRL advisor can provide $\hat{R}(s')$ from a safe $s$ is equivalent to LVFS where the stories are constrained to be singleton histories.

This seems intuitive to me. If you want to avoid a problem, you have to see it coming. To see it coming, you have to get information about it before you encounter it. In an MDP, that information needs to include state/reward information, so that one can learn what states to avoid. It also needs to include state/action/outcome information, so that one can learn how to avoid given states.

All of the nuance comes in when you consider how an algorithm incorporates the information, how it structures the voting to discover untrustworthy states, and how it trades off between exploring and exploiting.

Going Further

The paper lists a few areas that they want to address in the future. These mostly involve finding more efficient ways to solve decoupled-CRMDPs, expanding to partially observed or unobserved states, and time-varying corruption functions. Basically they ask how to make their formalism more realistic.

The one thing they didn’t mention is ergodicity. They assume (implicitly in that their MDPs are communicating) that their MDPs are ergodic. This is a very common assumption in RL, because it makes things more tractable. Unfortunately, it’s not at all realistic. In the real world, if your agent gets run over by a car then it can’t just start again from the beginning. If it destroys the world, you can’t just give it -5 points and try again.

Decoupled MDPs seem like the perfect tool to start solving non-ergodic learning problems. How do you explore in a way to avoid disaster, given that you could get information about disaster from safe states? What assumptions are needed to prove that a disaster is avoidable?

These are the questions that I want to see addressed in future work.