Abstract
In current reinforcement learning (RL) methods, function approximation errorsare known to lead to the overestimated or underestimated state-action values Q,which further lead to suboptimal policies. We show that the learning of astate-action return distribution function can be used to improve the estimationaccuracy of the Q-value. We combine the distributional return function withinthe maximum entropy RL framework in order to develop what we call theDistributional Soft Actor-Critic algorithm, DSAC, which is an off-policy methodfor continuous control setting. Unlike traditional distributional Q algorithmswhich typically only learn a discrete return distribution, DSAC can directlylearn a continuous return distribution by truncating the difference between thetarget and current return distribution to prevent gradient explosion.Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learnerarchitecture (PABAL) to improve the learning efficiency. We evaluate our methodon the suite of MuJoCo continuous control tasks, achieving the state of the artperformance.