Conservative Q-Learning for Offline Reinforcement Learning

Abstract

Effectively leveraging large, previously collected datasets in reinforcementlearning (RL) is a key challenge for large-scale real-world applications.Offline RL algorithms promise to learn effective policies frompreviously-collected, static datasets without further interaction. However, inpractice, offline RL presents a major challenge, and standard off-policy RLmethods can fail due to overestimation of values induced by the distributionalshift between the dataset and the learned policy, especially when training oncomplex and multi-modal data distributions. In this paper, we proposeconservative Q-learning (CQL), which aims to address these limitations bylearning a conservative Q-function such that the expected value of a policyunder this Q-function lower-bounds its true value. We theoretically show thatCQL produces a lower bound on the value of the current policy and that it canbe incorporated into a principled policy improvement procedure. In practice,CQL augments the standard Bellman error objective with a simple Q-valueregularizer which is straightforward to implement on top of existing deepQ-learning and actor-critic implementations. On both discrete and continuouscontrol domains, we show that CQL substantially outperforms existing offline RLmethods, often learning policies that attain 2-5 times higher final return,especially when learning from complex and multi-modal data distributions.

Quick Read (beta)

loading the full paper ...