Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons

Abstract

We consider reinforcement learning (RL) methods in offline domains withoutadditional online data collection, such as mobile health applications. Most ofexisting policy optimization algorithms in the computer science literature aredeveloped in online settings where data are easy to collect or simulate. Theirgeneralizations to mobile health applications with a pre-collected offlinedataset remain unknown. The aim of this paper is to develop a novel advantagelearning framework in order to efficiently use pre-collected data for policyoptimization. The proposed method takes an optimal Q-estimator computed by anyexisting state-of-the-art RL algorithms as input, and outputs a new policywhose value is guaranteed to converge at a faster rate than the policy derivedbased on the initial Q-estimator. Extensive numerical experiments are conductedto back up our theoretical findings. A Python implementation of our proposedmethod is available at https://github.com/leyuanheart/SEAL.

Quick Read (beta)

loading the full paper ...