Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

Abstract

In current reinforcement learning (RL) methods, function approximation errorsare known to lead to the overestimated or underestimated state-action values Q,which further lead to suboptimal policies. We show that the learning of astate-action return distribution function can be used to improve the estimationaccuracy of the Q-value. We combine the distributional return function withinthe maximum entropy RL framework in order to develop what we call theDistributional Soft Actor-Critic algorithm, DSAC, which is an off-policy methodfor continuous control setting. Unlike traditional distributional Q algorithmswhich typically only learn a discrete return distribution, DSAC can directlylearn a continuous return distribution by truncating the difference between thetarget and current return distribution to prevent gradient explosion.Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learnerarchitecture (PABAL) to improve the learning efficiency. We evaluate our methodon the suite of MuJoCo continuous control tasks, achieving the state of the artperformance.

Quick Read (beta)

loading the full paper ...