Abstract
A fundamental challenge in reinforcement learning is to learn policies thatgeneralize beyond the operating domains experienced during training. In thispaper, we approach this challenge through the following invariance principle:an agent must find a representation such that there exists an action-predictorbuilt on top of this representation that is simultaneously optimal across alltraining domains. Intuitively, the resulting invariant policy enhancesgeneralization by finding causes of successful actions. We propose a novellearning algorithm, Invariant Policy Optimization (IPO), that implements thisprinciple and learns an invariant policy during training. We compare ourapproach with standard policy gradient methods and demonstrate significantimprovements in generalization performance on unseen domains for linearquadratic regulator and grid-world problems, and an example where a robot mustlearn to open doors with varying physical properties.