STEEL: Singularity-aware Reinforcement Learning

Abstract

Batch reinforcement learning (RL) aims at leveraging pre-collected data tofind an optimal policy that maximizes the expected total rewards in a dynamicenvironment. Nearly all existing algorithms rely on the absolutely continuousassumption on the distribution induced by target policies with respect to thedata distribution, so that the batch data can be used to calibrate targetpolicies via the change of measure. However, the absolute continuity assumptioncould be violated in practice (e.g., no-overlap support), especially when thestate-action space is large or continuous. In this paper, we propose a newbatch RL algorithm without requiring absolute continuity in the setting of aninfinite-horizon Markov decision process with continuous states and actions. Wecall our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Ouralgorithm is motivated by a new error analysis on off-policy evaluation, wherewe use maximum mean discrepancy, together with distributionally robustoptimization, to characterize the error of off-policy evaluation caused by thepossible singularity and to enable model extrapolation. By leveraging the ideaof pessimism and under some mild conditions, we derive a finite-sample regretguarantee for our proposed algorithm without imposing absolute continuity.Compared with existing algorithms, by requiring only minimal data-coverageassumption, STEEL significantly improves the applicability and robustness ofbatch RL. Extensive simulation studies and one real experiment on personalizedpricing demonstrate the superior performance of our method in dealing withpossible singularity in batch RL.

Quick Read (beta)

loading the full paper ...