Abstract
Offline reinforcement learning (RL) defines a sample-efficient learningparadigm, where a policy is learned from static and previously collecteddatasets without additional interaction with the environment. The majorobstacle to offline RL is the estimation error arising from evaluating thevalue of out-of-distribution actions. To tackle this problem, most existingoffline RL methods attempt to acquire a policy both ``close" to the behaviorscontained in the dataset and sufficiently improved over them, which requires atrade-off between two possibly conflicting targets. In this paper, we propose anovel approach, which we refer to as adaptive behavior regularization (ABR), tobalance this critical trade-off. By simply utilizing a sample-basedregularization, ABR enables the policy to adaptively adjust its optimizationobjective between cloning and improving over the policy used to generate thedataset. In the evaluation on D4RL datasets, a widely adopted benchmark foroffline reinforcement learning, ABR can achieve improved or competitiveperformance compared to existing state-of-the-art algorithms.