Measuring and avoiding side effects using relative reachability

Abstract

How can we design reinforcement learning agents that avoid causingunnecessary disruptions to their environment? We argue that current approachesto penalizing side effects can introduce bad incentives in tasks that requireirreversible actions, and in environments that contain sources of change otherthan the agent. For example, some approaches give the agent an incentive toprevent any irreversible changes in the environment, including the actions ofother agents. We introduce a general definition of side effects, based onrelative reachability of states compared to a default state, that avoids theseundesirable incentives. Using a set of gridworld experiments illustratingrelevant scenarios, we empirically compare relative reachability to penaltiesbased on existing definitions and show that it is the only penalty among thosetested that produces the desired behavior in all the scenarios.

Quick Read (beta)

loading the full paper ...