Abstract
Many of the challenges facing today's reinforcement learning (RL) algorithms,such as robustness, generalization, transfer, and computational efficiency areclosely related to compression. Prior work has convincingly argued whyminimizing information is useful in the supervised learning setting, butstandard RL algorithms lack an explicit mechanism for compression. The RLsetting is unique because (1) its sequential nature allows an agent to use pastinformation to avoid looking at future observations and (2) the agent canoptimize its behavior to prefer states where decision making requires few bits.We take advantage of these properties to propose a method (RPC) for learningsimple policies. This method brings together ideas from informationbottlenecks, model-based RL, and bits-back coding into a simple andtheoretically-justified algorithm. Our method jointly optimizes a latent-spacemodel and policy to be self-consistent, such that the policy avoids stateswhere the model is inaccurate. We demonstrate that our method achieves muchtighter compression than prior methods, achieving up to 5x higher reward than astandard information bottleneck. We also demonstrate that our method learnspolicies that are more robust and generalize better to new tasks.