Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

Abstract

We consider the problem of learning the best possible policy from a fixeddataset, known as offline Reinforcement Learning (RL). A common taxonomy ofexisting offline RL works is policy regularization, which typically constrainsthe learned policy by distribution or support of the behavior policy. However,distribution and support constraints are overly conservative since they bothforce the policy to choose similar actions as the behavior policy whenconsidering particular states. It will limit the learned policy's performance,especially when the behavior policy is sub-optimal. In this paper, we find thatregularizing the policy towards the nearest state-action pair can be moreeffective and thus propose Policy Regularization with Dataset Constraint(PRDC). When updating the policy in a given state, PRDC searches the entiredataset for the nearest state-action sample and then restricts the policy withthe action of this sample. Unlike previous works, PRDC can guide the policywith proper behaviors from the dataset, allowing it to choose actions that donot appear in the dataset along with the given state. It is a softer constraintbut still keeps enough conservatism from out-of-distribution actions. Empiricalevidence and theoretical analysis show that PRDC can alleviate offline RL'sfundamentally challenging value overestimation issue with a bounded performancegap. Moreover, on a set of locomotion and navigation tasks, PRDC achievesstate-of-the-art performance compared with existing methods. Code is availableat https://github.com/LAMDA-RL/PRDC

Quick Read (beta)

loading the full paper ...