Behavior-Guided Reinforcement Learning

Abstract

We introduce a new approach for comparing reinforcement learning policies,using Wasserstein distances (WDs) in a newly defined latent behavioral space.We show that by utilizing the dual formulation of the WD, we can learn scorefunctions over trajectories that can be in turn used to lead policyoptimization towards (or away from) (un)desired behaviors. Combined withsmoothed WDs, the dual formulation allows us to devise efficient algorithmsthat take stochastic gradient descent steps through WD regularizers. Weincorporate these regularizers into two novel on-policy algorithms,Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, whichwe demonstrate can outperform existing methods in a variety of challengingenvironments. We also provide an open source demo.

Quick Read (beta)

loading the full paper ...