Abstract
We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithmfor reinforcement learning in continuous action spaces. WPO can be derived asan approximation to Wasserstein gradient flow over the space of all policiesprojected into a finite-dimensional parameter space (e.g., the weights of aneural network), leading to a simple and completely general closed-form update.The resulting algorithm combines many properties of deterministic and classicpolicy gradient methods. Like deterministic policy gradients, it exploitsknowledge of the gradient of the action-value function with respect to theaction. Like classic policy gradients, it can be applied to stochastic policieswith arbitrary distributions over actions -- without using thereparameterization trick. We show results on the DeepMind Control Suite and amagnetic confinement fusion task which compare favorably with state-of-the-artcontinuous control methods.