Abstract
Online interactions with the environment to collect data samples for traininga Reinforcement Learning (RL) agent is not always feasible due to economic andsafety concerns. The goal of Offline Reinforcement Learning is to address thisproblem by learning effective policies using previously collected datasets.Standard off-policy RL algorithms are prone to overestimations of the values ofout-of-distribution (less explored) actions and are hence unsuitable forOffline RL. Behavior regularization, which constraints the learned policywithin the support set of the dataset, has been proposed to tackle thelimitations of standard off-policy algorithms. In this paper, we improve thebehavior regularized offline reinforcement learning and propose BRAC+. First,we propose quantification of the out-of-distribution actions and conductcomparisons between using Kullback-Leibler divergence versus using Maximum MeanDiscrepancy as the regularization protocol. We propose an analytical upperbound on the KL divergence as the behavior regularizer to reduce varianceassociated with sample based estimations. Second, we mathematically show thatthe learned Q values can diverge even using behavior regularized policy updateunder mild assumptions. This leads to large overestimations of the Q values andperformance deterioration of the learned policy. To mitigate this issue, we adda gradient penalty term to the policy evaluation objective. By doing so, the Qvalues are guaranteed to converge. On challenging offline RL benchmarks, BRAC+outperforms the baseline behavior regularized approaches by 40%~87% and thestate-of-the-art approach by 6%.