Observation Interference in Partially Observable Assistance Games

Abstract

We study partially observable assistance games (POAGs), a model of thehuman-AI value alignment problem which allows the human and the AI assistant tohave partial observations. Motivated by concerns of AI deception, we study aqualitatively new phenomenon made possible by partial observability: would anAI assistant ever have an incentive to interfere with the human's observations?First, we prove that sometimes an optimal assistant must takeobservation-interfering actions, even when the human is playing optimally, andeven when there are otherwise-equivalent actions available that do notinterfere with observations. Though this result seems to contradict the classictheorem from single-agent decision making that the value of perfect informationis nonnegative, we resolve this seeming contradiction by developing a notion ofinterference defined on entire policies. This can be viewed as an extension ofthe classic result that the value of perfect information is nonnegative intothe cooperative multiagent setting. Second, we prove that if the human issimply making decisions based on their immediate outcomes, the assistant mightneed to interfere with observations as a way to query the human's preferences.We show that this incentive for interference goes away if the human is playingoptimally, or if we introduce a communication channel for the human tocommunicate their preferences to the assistant. Third, we show that if thehuman acts according to the Boltzmann model of irrationality, this can createan incentive for the assistant to interfere with observations. Finally, we usean experimental model to analyze tradeoffs faced by the AI assistant inpractice when considering whether or not to take observation-interferingactions.

Quick Read (beta)

loading the full paper ...