Abstract
Policy constraint methods in offline reinforcement learning employ additionalregularization techniques to constrain the discrepancy between the learnedpolicy and the offline dataset. However, these methods tend to result in overlyconservative policies that resemble the behavior policy, thus limiting theirperformance. We investigate this limitation and attribute it to the staticnature of traditional constraints. In this paper, we propose a novel dynamicpolicy constraint that restricts the learned policy on the samples generated bythe exponential moving average of previously learned policies. By integratingthis self-constraint mechanism into off-policy methods, our method facilitatesthe learning of non-conservative policies while avoiding policy collapse in theoffline setting. Theoretical results show that our approach results in a nearlymonotonically improved reference policy. Extensive experiments on the D4RLMuJoCo domain demonstrate that our proposed method achieves state-of-the-artperformance among the policy constraint methods.