Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning

Abstract

Deep reinforcement learning (DRL) on Markov decision processes (MDPs) withcontinuous action spaces is often approached by directly updating parametricpolicies along the direction of estimated policy gradients (PGs). Previousresearch revealed that the performance of these PG algorithms depends heavilyon the bias-variance tradeoff involved in estimating and using PGs. A notableapproach towards balancing this tradeoff is to merge both on-policy andoff-policy gradient estimations for the purpose of training stochasticpolicies. However this method cannot be utilized directly by sample-efficientoff-policy PG algorithms such as Deep Deterministic Policy Gradient (DDPG) andtwin-delayed DDPG (TD3), which have been designed to train deterministicpolicies. It is hence important to develop new techniques to merge multipleoff-policy estimations of deterministic PG (DPG). Driven by this researchquestion, this paper introduces elite DPG which will be estimated differentlyfrom conventional DPG to emphasize on the variance reduction effect at theexpense of increased learning bias. To mitigate the extra bias, policyconsolidation techniques will be developed to distill policy behavioralknowledge from elite trajectories and use the distilled generative model tofurther regularize policy training. Moreover, we will study both theoreticallyand experimentally two different DPG merging methods, i.e., interpolationmerging and two-step merging, with the aim to induce varied bias-variancetradeoff through combined use of both conventional DPG and elite DPG.Experiments on six benchmark control tasks confirm that these two mergingmethods can noticeably improve the learning performance of TD3, significantlyoutperforming several state-of-the-art DRL algorithms.

Quick Read (beta)

loading the full paper ...