Provably Good Batch Reinforcement Learning Without Great Exploration

Abstract

Batch reinforcement learning (RL) is important to apply RL algorithms to manyhigh stakes tasks. Doing batch RL in a way that yields a reliable new policy inlarge domains is challenging: a new decision policy may visit states andactions outside the support of the batch data, and function approximation andoptimization with limited samples can further increase the potential oflearning policies with overly optimistic estimates of their future performance.Recent algorithms have shown promise but can still be overly optimistic intheir expected outcomes. Theoretical work that provides strong guarantees onthe performance of the output policy relies on a strong concentrabilityassumption, that makes it unsuitable for cases where the ratio betweenstate-action distributions of behavior policy and some candidate policies islarge. This is because in the traditional analysis, the error bound scales upwith this ratio. We show that a small modification to Bellman optimality andevaluation back-up to take a more conservative update can have much strongerguarantees. In certain settings, they can find the approximately best policywithin the state-action space explored by the batch data, without requiring apriori assumptions of concentrability. We highlight the necessity of ourconservative update and the limitations of previous algorithms and analyses byillustrative MDP examples, and demonstrate an empirical comparison of ouralgorithm and other state-of-the-art batch RL baselines in standard benchmarks.

Quick Read (beta)

loading the full paper ...