Abstract
We study how representation learning can accelerate reinforcement learningfrom rich observations, such as images, without relying either on domainknowledge or pixel-reconstruction. Our goal is to learn representations thatboth provide for effective downstream control and invariance to task-irrelevantdetails. Bisimulation metrics quantify behavioral similarity between states incontinuous MDPs, which we propose using to learn robust latent representationswhich encode only the task-relevant information from observations. Our methodtrains encoders such that distances in latent space equal bisimulationdistances in state space. We demonstrate the effectiveness of our method atdisregarding task-irrelevant information using modified visual MuJoCo tasks,where the background is replaced with moving distractors and natural videos,while achieving SOTA performance. We also test a first-person highway drivingtask where our method learns invariance to clouds, weather, and time of day.Finally, we provide generalization results drawn from properties ofbisimulation metrics, and links to causal inference.