Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

Abstract

Off-policy evaluation (OPE) in reinforcement learning is an important problemin settings where experimentation is limited, such as education and healthcare.But, in these very same settings, observed actions are often confounded byunobserved variables making OPE even more difficult. We study an OPE problem inan infinite-horizon, ergodic Markov decision process with unobservedconfounders, where states and actions can act as proxies for the unobservedconfounders. We show how, given only a latent variable model for states andactions, policy value can be identified from off-policy data. Our methodinvolves two stages. In the first, we show how to use proxies to estimatestationary distribution ratios, extending recent work on breaking the curse ofhorizon to the confounded setting. In the second, we show optimal balancing canbe combined with such learned ratios to obtain policy value while avoidingdirect modeling of reward functions. We establish theoretical guarantees ofconsistency, and benchmark our method empirically.

Quick Read (beta)

loading the full paper ...