Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning

Abstract

The field of Deep Reinforcement Learning (DRL) has recently seen a surge inthe popularity of maximum entropy reinforcement learning algorithms. Theirpopularity stems from the intuitive interpretation of the maximum entropyobjective and their superior sample efficiency on standard benchmarks. In thispaper, we seek to understand the primary contribution of the entropy term tothe performance of maximum entropy algorithms. For the Mujoco benchmark, wedemonstrate that the entropy term in Soft Actor-Critic (SAC) principallyaddresses the bounded nature of the action spaces. With this insight, wepropose a simple normalization scheme which allows a streamlined algorithmwithout entropy maximization match the performance of SAC. Our experimentalresults demonstrate a need to revisit the benefits of entropy regularization inDRL. We also propose a simple non-uniform sampling method for selectingtransitions from the replay buffer during training. We further show that thestreamlined algorithm with the simple non-uniform sampling scheme outperformsSAC and achieves state-of-the-art performance on challenging continuous controltasks.

Quick Read (beta)

loading the full paper ...