Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Abstract

Model-free deep reinforcement learning (RL) algorithms have been demonstratedon a range of challenging decision making and control tasks. However, thesemethods typically suffer from two major challenges: very high sample complexityand brittle convergence properties, which necessitate meticulous hyperparametertuning. Both of these challenges severely limit the applicability of suchmethods to complex, real-world domains. In this paper, we propose softactor-critic, an off-policy actor-critic deep RL algorithm based on the maximumentropy reinforcement learning framework. In this framework, the actor aims tomaximize expected reward while also maximizing entropy. That is, to succeed atthe task while acting as randomly as possible. Prior deep RL methods based onthis framework have been formulated as Q-learning methods. By combiningoff-policy updates with a stable stochastic actor-critic formulation, ourmethod achieves state-of-the-art performance on a range of continuous controlbenchmark tasks, outperforming prior on-policy and off-policy methods.Furthermore, we demonstrate that, in contrast to other off-policy algorithms,our approach is very stable, achieving very similar performance acrossdifferent random seeds.

Quick Read (beta)

loading the full paper ...