Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Abstract

Off-policy evaluation (OPE) in reinforcement learning is notoriouslydifficult in long- and infinite-horizon settings due to diminishing overlapbetween behavior and target policies. In this paper, we study the role ofMarkovian and time-invariant structure in efficient OPE. We first derive theefficiency bounds for OPE when one assumes each of these structures. Thisprecisely characterizes the curse of horizon: in time-variant processes, OPE isonly feasible in the near-on-policy setting, where behavior and target policiesare sufficiently similar. But, in time-invariant Markov decision processes, ourbounds show that truly-off-policy evaluation is feasible, even with only justone dependent trajectory, and provide the limits of how well we could hope todo. We develop a new estimator based on Double Reinforcement Learning (DRL)that leverages this structure for OPE using the efficient influence function wederive. Our DRL estimator simultaneously uses estimated stationary densityratios and $q$-functions and remains efficient when both are estimated at slow,nonparametric rates and remains consistent when either is estimatedconsistently. We investigate these properties and the performance benefits ofleveraging the problem structure for more efficient OPE.

Quick Read (beta)

loading the full paper ...