Abstract
In offline reinforcement learning (RL) an optimal policy is learnt solelyfrom a priori collected observational data. However, in observational data,actions are often confounded by unobserved variables. Instrumental variables(IVs), in the context of RL, are the variables whose influence on the statevariables are all mediated through the action. When a valid instrument ispresent, we can recover the confounded transition dynamics throughobservational data. We study a confounded Markov decision process where thetransition dynamics admit an additive nonlinear functional form. Using IVs, wederive a conditional moment restriction (CMR) through which we can identifytransition dynamics based on observational data. We propose a provablyefficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dualreformulation of CMR. To the best of our knowledge, this is the first provablyefficient algorithm for instrument-aided offline RL.