Contextual Conservative Q-Learning for Offline Reinforcement Learning

Abstract

Offline reinforcement learning learns an effective policy on offline datasetswithout online interaction, and it attracts persistent research attention dueto its potential of practical application. However, extrapolation errorgenerated by distribution shift will still lead to the overestimation for thoseactions that transit to out-of-distribution(OOD) states, which degrades thereliability and robustness of the offline policy. In this paper, we proposeContextual Conservative Q-Learning(C-CQL) to learn a robustly reliable policythrough the contextual information captured via an inverse dynamics model. Withthe supervision of the inverse dynamics model, it tends to learn a policy thatgenerates stable transition at perturbed states, for the fact that pertuebedstates are a common kind of OOD states. In this manner, we enable the learntpolicy more likely to generate transition that destines to the empirical nextstate distributions of the offline dataset, i.e., robustly reliable transition.Besides, we theoretically reveal that C-CQL is the generalization of theConservative Q-Learning(CQL) and aggressive State Deviation Correction(SDC).Finally, experimental results demonstrate the proposed C-CQL achieves thestate-of-the-art performance in most environments of offline Mujoco suite and anoisy Mujoco setting.

Quick Read (beta)

loading the full paper ...