Abstract
Offline reinforcement learning algorithms often require carefulhyperparameter tuning. Consequently, before deployment, we need to selectamongst a set of candidate policies. As yet, however, there is littleunderstanding about the fundamental limits of this offline policy selection(OPS) problem. In this work we aim to provide clarity on when sample efficientOPS is possible, primarily by connecting OPS to off-policy policy evaluation(OPE) and Bellman error (BE) estimation. We first show a hardness result, thatin the worst case, OPS is just as hard as OPE, by proving a reduction of OPE toOPS. As a result, no OPS method can be more sample efficient than OPE in theworst case. We then propose a BE method for OPS, called Identifiable BESelection (IBES), that has a straightforward method for selecting its ownhyperparameters. We highlight that using IBES for OPS generally has morerequirements than OPE methods, but if satisfied, can be more sample efficient.We conclude with an empirical study comparing OPE and IBES, and by showing thedifficulty of OPS on an offline Atari benchmark dataset.