Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Abstract

Training generally capable agents that thoroughly explore their environmentand learn new and diverse skills is a long-term goal of robot learning. QualityDiversity Reinforcement Learning (QD-RL) is an emerging research area thatblends the best aspects of both fields -- Quality Diversity (QD) provides aprincipled form of exploration and produces collections of behaviorally diverseagents, while Reinforcement Learning (RL) provides a powerful performanceimprovement operator enabling generalization across tasks and dynamicenvironments. Existing QD-RL approaches have been constrained to sampleefficient, deterministic off-policy RL algorithms and/or evolution strategies,and struggle with highly stochastic environments. In this work, we, for thefirst time, adapt on-policy RL, specifically Proximal Policy Optimization(PPO), to the Differentiable Quality Diversity (DQD) framework and proposeadditional improvements over prior work that enable efficient optimization anddiscovery of novel skills on challenging locomotion tasks. Our new algorithm,Proximal Policy Gradient Arborescence (PPGA), achieves state-of-the-artresults, including a 4x improvement in best reward over baselines on thechallenging humanoid domain.

Quick Read (beta)

loading the full paper ...