Harnessing Causality in Reinforcement Learning With Bagged Decision Times

Abstract

We consider reinforcement learning (RL) for a class of problems with baggeddecision times. A bag contains a finite sequence of consecutive decision times.The transition dynamics are non-Markovian and non-stationary within a bag. Allactions within a bag jointly impact a single reward, observed at the end of thebag. For example, in mobile health, multiple activity suggestions in a daycollectively affect a user's daily commitment to being active. Our goal is todevelop an online RL algorithm to maximize the discounted sum of thebag-specific rewards. To handle non-Markovian transitions within a bag, weutilize an expert-provided causal directed acyclic graph (DAG). Based on theDAG, we construct states as a dynamical Bayesian sufficient statistic of theobserved history, which results in Markov state transitions within and acrossbags. We then formulate this problem as a periodic Markov decision process(MDP) that allows non-stationarity within a period. An online RL algorithmbased on Bellman equations for stationary MDPs is generalized to handleperiodic MDPs. We show that our constructed state achieves the maximal optimalvalue function among all state constructions for a periodic MDP. Finally, weevaluate the proposed method on testbed variants built from real data in amobile health clinical trial.

Quick Read (beta)

loading the full paper ...