ACL-QL: Adaptive Conservative Level in Q-Learning for Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL), which operates solely on static datasetswithout further interactions with the environment, provides an appealingalternative to learning a safe and promising control policy. The prevailingmethods typically learn a conservative policy to mitigate the problem ofQ-value overestimation, but it is prone to overdo it, leading to an overlyconservative policy. Moreover, they optimize all samples equally with fixedconstraints, lacking the nuanced ability to control conservative levels in afine-grained manner. Consequently, this limitation results in a performancedecline. To address the above two challenges in a united way, we propose aframework, Adaptive Conservative Level in Q-Learning (ACL-QL), which limits theQ-values in a mild range and enables adaptive control on the conservative levelover each state-action pair, i.e., lifting the Q-values more for goodtransitions and less for bad transitions. We theoretically analyze theconditions under which the conservative level of the learned Q-function can belimited in a mild range and how to optimize each transition adaptively.Motivated by the theoretical analysis, we propose a novel algorithm, ACL-QL,which uses two learnable adaptive weight functions to control the conservativelevel over each transition. Subsequently, we design a monotonicity loss andsurrogate losses to train the adaptive weight functions, Q-function, and policynetwork alternatively. We evaluate ACL-QL on the commonly used D4RL benchmarkand conduct extensive ablation studies to illustrate the effectiveness andstate-of-the-art performance compared to existing offline DRL baselines.

Quick Read (beta)

loading the full paper ...