In his paper on the Value Learning Problem, Nate Soares identifies the problem of ontology shift:
Consider a programmer that wants to train a system to pursue a very simple goal: produce diamond. The programmers have an atomic model of physics, and they generate training data labeled according to the number of carbon atoms covalently bound to four other carbon atoms in that training outcome. For this training data to be used, the classification algorithm needs to identify the atoms in a potential outcome considered by the system. In this toy example, we can assume that the programmers look at the structure of the initial worldmodel and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis.
The programmer defined what they wanted in an ontology that their system no longer uses, so the programmer’s goals are now no long relevant to what the system is actually interacting with.
To solve this problem, an artificial intelligence would have to notice when it is changing ontologies. In the story, the system knows about carbon as a logical concept, and then abandons the carbon concept when it learns about protons, neutrons, and electrons. On abandoning the concept of carbon (or any other concept), the system could re-evaluate its utility function to see if the change causes a new understanding of something within that utility function.
Intuitively, a system smart enough to say that carbon is actually made up of 6 protons could reflect the impact of such a discovery on the utility function.
A more worrying feature of an ontology shift is that it implies that an AI may be translating it’s utility function into its current ontology. The translation operation is unlikely to be obvious, and may allow not just direct translation but also re-interpretation. The translated utility function may not be endorsed by the AI’s original programmer.
This is true even if the utility function is something nice like “figure out what I, your creator, would do if I were smarter, then do that.” The ontology that the agent uses may change, and what “your creator” and “smarter” mean may change significantly.
What we’d like to have is some guarantee that the utility function used after an ontology shift satisfies the important parts of the utility function before the shift. This is true whether the new utility function is an attempt at direct translation or a looser re-interpretation.
One idea for how to do this is to find objects in the new ontology that subjunctively depend upon the original utility function. If it can be shown that the new utility function and the old one are in some sense computing the same logical object, then it may be possible to trust the new utility function before it is put in place.