Abstract
We study how to learn optimal interventions sequentially given causalinformation represented as a causal graph along with associated conditionaldistributions. Causal modeling is useful in real world problems like onlineadvertisement where complex causal mechanisms underlie the relationship betweeninterventions and outcomes. We propose two algorithms, causal upper confidencebound (C-UCB) and causal Thompson Sampling (C-TS), that enjoy improvedcumulative regret bounds compared with algorithms that do not use causalinformation. We thus resolve an open problem posed by\cite{lattimore2016causal}. Further, we extend C-UCB and C-TS to the linearbandit setting and propose causal linear UCB (CL-UCB) and causal linear TS(CL-TS) algorithms. These algorithms enjoy a cumulative regret bound that onlyscales with the feature dimension. Our experiments show the benefit of usingcausal information. For example, we observe that even with a few hundreds ofiterations, the regret of causal algorithms is less than that of standardalgorithms by a factor of three. We also show that under certain causalstructures, our algorithms scale better than the standard bandit algorithms asthe number of interventions increases.