Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

Abstract

In offline reinforcement learning, the challenge of out-of-distribution (OOD)is pronounced. To address this, existing methods often constrain the learnedpolicy through policy regularization. However, these methods often suffer fromthe issue of unnecessary conservativeness, hampering policy improvement. Thisoccurs due to the indiscriminate use of all actions from the behavior policythat generates the offline dataset as constraints. The problem becomesparticularly noticeable when the quality of the dataset is suboptimal. Thus, wepropose Adaptive Advantage-guided Policy Regularization (A2PR), obtaininghigh-advantage actions from an augmented behavior policy combined with VAE toguide the learned policy. A2PR can select high-advantage actions that differfrom those present in the dataset, while still effectively maintainingconservatism from OOD actions. This is achieved by harnessing the VAE capacityto generate samples matching the distribution of the data points. Wetheoretically prove that the improvement of the behavior policy is guaranteed.Besides, it effectively mitigates value overestimation with a boundedperformance gap. Empirically, we conduct a series of experiments on the D4RLbenchmark, where A2PR demonstrates state-of-the-art performance. Furthermore,experimental results on additional suboptimal mixed datasets reveal that A2PRexhibits superior performance. Code is available athttps://github.com/ltlhuuu/A2PR.

Quick Read (beta)

loading the full paper ...