STEEL: Singularity-aware Reinforcement Learning

Abstract

Batch reinforcement learning (RL) aims at leveraging pre-collected data tofind an optimal policy that maximizes the expected total rewards in a dynamicenvironment. The existing methods require absolutely continuous assumption(e.g., there do not exist non-overlapping regions) on the distribution inducedby target policies with respect to the data distribution over either the stateor action or both. We propose a new batch RL algorithm that allows forsingularity for both state and action spaces (e.g., existence ofnon-overlapping regions between offline data distribution and the distributioninduced by the target policies) in the setting of an infinite-horizon Markovdecision process with continuous states and actions. We call our algorithmSTEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated bya new error analysis on off-policy evaluation, where we use maximum meandiscrepancy, together with distributionally robust optimization, tocharacterize the error of off-policy evaluation caused by the possiblesingularity and to enable model extrapolation. By leveraging the idea ofpessimism and under some technical conditions, we derive a first finite-sampleregret guarantee for our proposed algorithm under singularity. Compared withexisting algorithms,by requiring only minimal data-coverage assumption, STEELimproves the applicability and robustness of batch RL. In addition, a two-stepadaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulationstudies and one (semi)-real experiment on personalized pricing demonstrate thesuperior performance of our methods in dealing with possible singularity inbatch RL.

Quick Read (beta)

loading the full paper ...