Supported Trust Region Optimization for Offline Reinforcement Learning

Abstract

Offline reinforcement learning suffers from the out-of-distribution issue andextrapolation error. Most policy constraint methods regularize the density ofthe trained policy towards the behavior policy, which is too restrictive inmost cases. We propose Supported Trust Region optimization (STR) which performstrust region policy optimization with the policy constrained within the supportof the behavior policy, enjoying the less restrictive support constraint. Weshow that, when assuming no approximation and sampling error, STR guaranteesstrict policy improvement until convergence to the optimal support-constrainedpolicy in the dataset. Further with both errors incorporated, STR stillguarantees safe policy improvement for each step. Empirical results validatethe theory of STR and demonstrate its state-of-the-art performance on MuJoColocomotion domains and much more challenging AntMaze domains.

Quick Read (beta)

loading the full paper ...