Policy Constraint by Only Support Constraint for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) aims to optimize a policy by usingpre-collected datasets, to maximize cumulative rewards. However, offlinereinforcement learning suffers challenges due to the distributional shiftbetween the learned and behavior policies, leading to errors when computingQ-values for out-of-distribution (OOD) actions. To mitigate this issue, policyconstraint methods aim to constrain the learned policy's distribution with thedistribution of the behavior policy or confine action selection within thesupport of the behavior policy. However, current policy constraint methods tendto exhibit excessive conservatism, hindering the policy from further surpassingthe behavior policy's performance. In this work, we present Only SupportConstraint (OSC) which is derived from maximizing the total probability oflearned policy in the support of behavior policy, to address the conservatismof policy constraint. OSC presents a regularization term that only restrictspolicies to the support without imposing extra constraints on actions withinthe support. Additionally, to fully harness the performance of the new policyconstraints, OSC utilizes a diffusion model to effectively characterize thesupport of behavior policies. Experimental evaluations across a variety ofoffline RL benchmarks demonstrate that OSC significantly enhances performance,alleviating the challenges associated with distributional shifts and mitigatingconservatism of policy constraints. Code is available athttps://github.com/MoreanP/OSC.

Quick Read (beta)

loading the full paper ...