Mildly Conservative Q-Learning for Offline Reinforcement Learning

  • 2022-06-09 20:44:35
  • Jiafei Lyu, Xiaoteng Ma, Xiu Li, Zongqing Lu
  • 1


Offline reinforcement learning (RL) defines the task of learning from astatic logged dataset without continually interacting with the environment. Thedistribution shift between the learned policy and the behavior policy makes itnecessary for the value function to stay conservative such thatout-of-distribution (OOD) actions will not be severely overestimated. However,existing approaches, penalizing the unseen actions or regularizing with thebehavior policy, are too pessimistic, which suppresses the generalization ofthe value function and hinders the performance improvement. This paper exploresmild but enough conservatism for offline learning while not harminggeneralization. We propose Mildly Conservative Q-learning (MCQ), where OODactions are actively trained by assigning them proper pseudo Q values. Wetheoretically show that MCQ induces a policy that behaves at least as well asthe behavior policy and no erroneous overestimation will occur for OOD actions.Experimental results on the D4RL benchmarks demonstrate that MCQ achievesremarkable performance compared with prior work. Furthermore, MCQ showssuperior generalization ability when transferring from offline to online, andsignificantly outperforms baselines.


Quick Read (beta)

loading the full paper ...