Wasserstein Unsupervised Reinforcement Learning

Abstract

Unsupervised reinforcement learning aims to train agents to learn a handfulof policies or skills in environments without external reward. Thesepre-trained policies can accelerate learning when endowed with external reward,and can also be used as primitive options in hierarchical reinforcementlearning. Conventional approaches of unsupervised skill discovery feed a latentvariable to the agent and shed its empowerment on agent's behavior by mutualinformation (MI) maximization. However, the policies learned by MI-basedmethods cannot sufficiently explore the state space, despite they can besuccessfully identified from each other. Therefore we propose a new frameworkWasserstein unsupervised reinforcement learning (WURL) where we directlymaximize the distance of state distributions induced by different policies.Additionally, we overcome difficulties in simultaneously training N(N >2)policies, and amortizing the overall reward to each step. Experiments showpolicies learned by our approach outperform MI-based methods on the metric ofWasserstein distance while keeping high discriminability. Furthermore, theagents trained by WURL can sufficiently explore the state space in mazes andMuJoCo tasks and the pre-trained policies can be applied to downstream tasks byhierarchical learning.

Quick Read (beta)

loading the full paper ...