Off-Policy Evaluation in Partially Observable Environments

Abstract

This work studies the problem of batch off-policy evaluation forReinforcement Learning in partially observable environments. Off-policyevaluation under partial observability is inherently prone to bias, with riskof arbitrarily large errors. We define the problem of off-policy evaluation forPartially Observable Markov Decision Processes (POMDPs) and establish what webelieve is the first off-policy evaluation result for POMDPs. In addition, weformulate a model in which observed and unobserved variables are decoupled intotwo dynamic processes, called a Decoupled POMDP. We show how off-policyevaluation can be performed under this new model, mitigating estimation errorsinherent to the procedure we provided for general POMDPs. We demonstrate thepitfalls of off-policy evaluation in POMDPs using a well-known off-policymethod, importance sampling, and compare with our result on synthetic medicaldata.

Quick Read (beta)

loading the full paper ...