Stateful Offline Contextual Policy Evaluation and Learning

Abstract

We study off-policy evaluation and learning from sequential data in astructured class of Markov decision processes that arise from repeatedinteractions with an exogenous sequence of arrivals with contexts, whichgenerate unknown individual-level responses to agent actions. This model can bethought of as an offline generalization of contextual bandits with resourceconstraints. We formalize the relevant causal structure of problems such asdynamic personalized pricing and other operations management problems in thepresence of potentially high-dimensional user types. The key insight is that anindividual-level response is often not causally affected by the state variableand can therefore easily be generalized across timesteps and states. When thisis true, we study implications for (doubly robust) off-policy evaluation andlearning by instead leveraging single time-step evaluation, estimating theexpectation over a single arrival via data from a population, for fitted-valueiteration in a marginal MDP. We study sample complexity and analyze erroramplification that leads to the persistence, rather than attenuation, ofconfounding error over time. In simulations of dynamic and capacitated pricing,we show improved out-of-sample policy performance in this class of relevantproblems.

Quick Read (beta)

loading the full paper ...