Conservative State Value Estimation for Offline Reinforcement Learning

Abstract

Offline reinforcement learning faces a significant challenge of valueover-estimation due to the distributional drift between the dataset and thecurrent learned policy, leading to learning failure in practice. The commonapproach is to incorporate a penalty term to reward or value estimation in theBellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution(OOD) states and actions, existing methods focus on conservative Q-functionestimation. In this paper, we propose Conservative State Value Estimation(CSVE), a new approach that learns conservative V-function via directlyimposing penalty on OOD states. Compared to prior work, CSVE allows moreeffective state value estimation with conservative guarantees and furtherbetter policy optimization. Further, we apply CSVE and develop a practicalactor-critic algorithm in which the critic does the conservative valueestimation by additionally sampling and penalizing the states \emph{around} thedataset, and the actor applies advantage weighted updates extended with stateexploration to improve the policy. We evaluate in classic continual controltasks of D4RL, showing that our method performs better than the conservativeQ-function learning methods and is strongly competitive among recent SOTAmethods.

Quick Read (beta)

loading the full paper ...