Population-Guided Parallel Policy Search for Reinforcement Learning

Abstract

In this paper, a new population-guided parallel learning scheme is proposedto enhance the performance of off-policy reinforcement learning (RL). In theproposed scheme, multiple identical learners with their own value-functions andpolicies share a common experience replay buffer, and search a good policy incollaboration with the guidance of the best policy information. The key pointis that the information of the best policy is fused in a soft manner byconstructing an augmented loss function for policy update to enlarge theoverall search region by the multiple learners. The guidance by the previousbest policy and the enlarged range enable faster and better policy search.Monotone improvement of the expected cumulative return by the proposed schemeis proved theoretically. Working algorithms are constructed by applying theproposed scheme to the twin delayed deep deterministic (TD3) policy gradientalgorithm. Numerical results show that the constructed algorithm outperformsmost of the current state-of-the-art RL algorithms, and the gain is significantin the case of sparse reward environment.

Quick Read (beta)

loading the full paper ...