DCE: Offline Reinforcement Learning With Double Conservative Estimates

Abstract

Offline Reinforcement Learning has attracted much interest in solving theapplication challenge for traditional reinforcement learning. Offlinereinforcement learning uses previously-collected datasets to train agentswithout any interaction. For addressing the overestimation of OOD(out-of-distribution) actions, conservative estimates give a low value for allinputs. Previous conservative estimation methods are usually difficult to avoidthe impact of OOD actions on Q-value estimates. In addition, these algorithmsusually need to lose some computational efficiency to achieve the purpose ofconservative estimation. In this paper, we propose a simple conservativeestimation method, double conservative estimates (DCE), which use twoconservative estimation method to constraint policy. Our algorithm introducesV-function to avoid the error of in-distribution action while implicitachieving conservative estimation. In addition, our algorithm uses acontrollable penalty term changing the degree of conservatism in training. Wetheoretically show how this method influences the estimation of OOD actions andin-distribution actions. Our experiment separately shows that two conservativeestimation methods impact the estimation of all state-action. DCE demonstratesthe state-of-the-art performance on D4RL.

Quick Read (beta)

loading the full paper ...