Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Abstract

We consider the evaluation and training of a new policy for the evaluationdata by using the historical data obtained from a different policy. The goal ofoff-policy evaluation (OPE) is to estimate the expected reward of a new policyover the evaluation data, and that of off-policy learning (OPL) is to find anew policy that maximizes the expected reward over the evaluation data.Although the standard OPE and OPL assume the same distribution of covariatebetween the historical and evaluation data, there often exists a problem of acovariate shift, i.e., the distribution of the covariate of the historical datais different from that of the evaluation data. In this paper, we derive theefficiency bound of OPE under a covariate shift. Then, we propose doubly robustand efficient estimators for OPE and OPL under a covariate shift by using anestimator of the density ratio between the distributions of the historical andevaluation data. We also discuss other possible estimators and compare theirtheoretical properties. Finally, we confirm the effectiveness of the proposedestimators through experiments.

Quick Read (beta)

loading the full paper ...