Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Abstract

Maximum entropy deep reinforcement learning (RL) methods have beendemonstrated on a range of challenging continuous tasks. However, existingmethods either suffer from severe instability when training on large off-policydata or cannot scale to tasks with very high state and action dimensionalitysuch as 3D humanoid locomotion. Besides, the optimality of desired Boltzmannpolicy set for non-optimal soft value function is not persuasive enough. Inthis paper, we first derive soft policy gradient based on entropy regularizedexpected reward objective for RL with continuous actions. Then, we present anoff-policy actor-critic, model-free maximum entropy deep RL algorithm calleddeep soft policy gradient (DSPG) by combining soft policy gradient with softBellman equation. To ensure stable learning while eliminating the need of twoseparate critics for soft value functions, we leverage double sampling approachto making the soft Bellman equation tractable. The experimental resultsdemonstrate that our method outperforms in performance over off-policy priormethods.

Quick Read (beta)

loading the full paper ...