Rarely-switching linear bandits: optimization of causal effects for the real world

  • 2019-05-30 15:52:51
  • Benjamin Lansdell, Sofia Triantafillou, Konrad Kording
  • 27


Exploring the effect of policies in many real world scenarios is difficult,unethical, or expensive. After all, doctor guidelines, tax codes, and pricelists can only be reprinted so often. We may thus want to only change a policywhen it is probable that the change is beneficial. Fortunately, thresholdsallow us to estimate treatment effects. Such estimates allows us to optimizethe threshold. Here, based on the theory of linear contextual bandits, wepresent a conservative policy updating procedure which updates a deterministicpolicy only when needed. We extend the theory of linear bandits to thisrarely-switching case, proving such procedures share the same regret, up toconstant scaling, as the common LinUCB algorithm. However the algorithm makesfar fewer changes to its policy. We provide simulations and an analysis of aninfant health well-being causal inference dataset, showing the algorithmefficiently learns a good policy with few changes. Our approach allowsefficiently solving problems where changes are to be avoided, with potentialapplications in economics, medicine and beyond.


Quick Read (beta)

Rarely-switching linear bandits: optimization of causal effects for the real world

Benjamin J. Lansdell,
Department of Bioengineering
University of Pennsylvania
Pennsylvania, PA 19104
[email protected]
&Sofia Triantafillou
Department of Biomedical Informatics
University of Pittsburgh School of Medicine
Pittburgh, PA 15206 &Konrad P. Kording
Department of Bioengineering
University of Pennsylvania
Pennsylvania, PA 19104

Exploring the effect of policies in many real world scenarios is difficult, unethical, or expensive. After all, doctor guidelines, tax codes, and price lists can only be reprinted so often. We may thus want to only change a policy when it is probable that the change is beneficial. Fortunately, thresholds allow us to estimate treatment effects. Such estimates allows us to optimize the threshold. Here, based on the theory of linear contextual bandits, we present a conservative policy updating procedure which updates a deterministic policy only when needed. We extend the theory of linear bandits to this rarely-switching case, proving such procedures share the same regret, up to constant scaling, as the common LinUCB algorithm. However the algorithm makes far fewer changes to its policy. We provide simulations and an analysis of an infant health well-being causal inference dataset, showing the algorithm efficiently learns a good policy with few changes. Our approach allows efficiently solving problems where changes are to be avoided, with potential applications in economics, medicine and beyond.


Rarely-switching linear bandits: optimization of causal effects for the real world

  Benjamin J. Lansdell, Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104 [email protected] Sofia Triantafillou Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittburgh, PA 15206 Konrad P. Kording Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Many decisions in healthcare, economics, and beyond are based on thresholds Moscoe2015 ; Bor2014 . Scholarships are awarded when exam scores exceed a threshold Thistlewaite1960 , assistance is given to those below an income thresholds, and many medical treatments and diagnoses are based on thresholding biometrics (e.g. cholesterol Nayor2017 ; Pandya2015 , high blood pressure Dennison-himmelfarb2014 , and diabetes/prediabetes Barry ). Such policies are often applied where it is difficult or unethical to experiment with the population and a policy should benefit the population as much as possible. Exploration is avoided: if a patient arrives with high blood pressure then they receive treatment – we can not explore with the aim of learning more about the treatment effect. Prior studies will typically have established treatment benefits. Thresholds are used when only the population above the threshold is expected to benefit from the treatment. While these policies are generally chosen based on domain knowledge, they could be further optimized.

Figure 1: a) Many policies assign a treatment by thresholding on a single covariate. Similarly, policies may b) assign treatment by thresholding on a linear combination of multiple covariates, or c) assign multiple treatments by thresholding on multiple linear combinations of covariates. d) By construction, the optimal threshold lies somewhere in a region determined by the confidence intervals on the outcome of each treatment.

Optimizing a policy while acting greedily is challenging. Particularly because, in cases where policies are health or economic recommendations, it may be difficult or costly to change a policy excessively. That is, there is an added need to find an optimal threshold in a way that minimizes both the number of erroneous changes and the number of total changes to the policy. In these cases changes to a policy should only be made when there is sufficient evidence it will improve outcomes – they should be rarely switching Abbasi-yadkori2011 . There is a conflict between the greedy aims of such a policy and the need to explore to optimize it.

How can a policy be efficiently learned while maintaining confidence each update will improve utility? The effect of a change in policy is an extrapolation: the effect of treatment for those well below a current threshold can only be estimated from the population above that threshold, since they are the only subjects to receive treatment according to the policy. Yet there may be many reasons a treatment will effect these populations differently. However, the populations just above and below the threshold are similar – the only difference between them is that one group receives the treatment and the other does not. In fact this allows for a measure of the causal effect of the treatment for this sub-population pearl2000causality . And this in turn provides a measure of the change in utility for a small change in policy.

The idea of measuring causal effects by comparing populations at a threshold is known as regression discontinuity design (RDD), and is commonly used in econometrics Imbens2017 . RDD is often used when a policy is determined by thresholding on a single covariate (Figure 1a). It has been used to estimate the effects of education programs Moss2014 , election outcomes Lee2008b , and policies determined by age-eligibility Carpenter2011 . It can also be extended to higher dimensions (Figure 1b; Bertanha2019 ; Cattaneo2019 ; Cheng2016 ; Wong2010 ; Cattaneo2016 ) or setting with multiple treatment options (Figure 1c; Papay2011 ). For instance, students may take tests with thresholds in several different subjects, and school districts may reward teachers for improving student test scores in both mathematics and English Papay2011 ; Wong2013 . When the conditions for their use are valid, RD designs are a reliable way to measure treatment effects. The above reasoning shows that causal effects as measured by RDD can be used to make small updates to a threshold with high confidence it will improve utility Dong2015a ; Marinescu . However, what schedule of updates to a threshold will converge on an optimal policy?

We propose to answer this question by framing the problem as a contextual multi-armed bandit. The balance between exploration and exploitation is well studied in multi-arm bandits. In a bandit problem an agent must choose from a set of actions at each round and only observes reward for the action chosen Szepesvari2018 . A contextual bandit provides the agent with side information about the set of choices at each round, which can be used to make a more informed decision Zhou2015a . Contextual multi-armed bandits are used in personalized medicine, targeted advertising and website design. Much is known under a variety of assumptions about the relation between context and reward – e.g. using linear models Abbasi-yadkori2011 ; Chu2011 ; Li2010a , Gaussian processes Krause2011 ; Srinivas2009 ; Guan2018 , and GLMs Filippi2010 . Further, there is an intimate relation between learning in a bandit setting and the defining problem of causal inference – that only one of multiple potential outcomes are ever observed – and thus there is a natural correspondence between multi-armed bandits and optimization in a setting where causal effects must also be learned. Many of the above motivating examples, once causal effects are identifiable, could be optimized using theory of multi-armed bandits. Here we draw from this theory to understand how to optimize rarely switching policies.

As machine learning and AI are increasingly used to aid decision making, there is a growing need for safety and performance guarantees to be understood and implemented Berkenkamp2015 ; Aboutalebi2019 . Thus a lot of recent work considers conservative or safe bandit algorithms. For instance, methods that do not perform below some baseline level Wu2016 ; Katariya2018 ; Kazerouni2017 ; Kakade ; Sun2016a ; Chow2019 , or that are risk-averse, avoiding large variance in outcomes Vakili2016 ; Sani2013 ; David2016 ; Zimin2014 , have been developed. In particular, recent work has considered exploration-free, or greedy bandit algorithms, in a linear contextual setting Bastani2017 ; Kannan2018 . Surprisingly, under certain distributional assumptions, noise alone provides the necessary exploration to allow learning for free Bastani2017 . In fact, a recent comparison between contextual bandit algorithms shows that greedy algorithms can do quite well in practice Bietti2018 ; Foster2018 . This suggests that sufficient conditions for a greedy algorithm to work well do occur: exploration bonuses can indeed be avoided in some settings. For this reason here we investigate rarely-switching policy optimization by casting the problem as a linear contextual bandit.

Only limited previous work has considered rarely-switching policies. For instance, the case where there is a cost to changing actions has received attention Guha2009 ; Koren2017 ; Koren2017a . Abbasi-Yadkori et al 2011 Abbasi-yadkori2011 consider a rarely-switching class of Linear upper confidence bound (LinUCB) algorithms. However their aim is to save computation, and not to minimize the number of changes made to avoid exploration with policy. Rather than considering cases where it is costly to change actions, we consider the case where it is costly to change policy.

Here we study how to find optimal thresholds while minimizing changes in policy as a linear contextual bandit. We consider a family of policies defined by a fixed threshold on a linear combination of covariates, that rarely switches, and that generalizes to higher dimensions and number of arms. We propose conservative policy updating algorithms that only update their threshold when justified. This extends the theory of the linear upper confidence bound (LinUCB) algorithm, and so the algorithms, up to constants, share its asymptotic regret behavior Abbasi-yadkori2011 . However they result in fewer erroneous changes to the policy, and fewer overall changes to the policy. Empirically, we show that these methods compare favorably to previous a linear bandit algorithm that rarely switch its policy Abbasi-yadkori2011 . Further, we show how such algorithms compare to methods that improve on a baseline policy, such as the conservative linear UCB (CLUCB) algorithm Kazerouni2017 , where with high probability the algorithm always performs better than a baseline, initial policy. We verify the method with simulated data, and show its utility on an infant health, causal inference dataset. This work thus shows how conservative bandit approaches can be extended to a common and important decision making framework.

1.1 Linear contextual bandits

We study the following version of the common stochastic linear bandit Tsitsiklis2010 ; Szepesvari2018 ; Kazerouni2017 ; Abbasi-yadkori2011 . At each round, t, the agent observes a context 𝐬t𝒟d and then selects an action at𝒜. We assume each context is drawn independently from a distribution ρ. The context may represent a feature vector from an underlying state: 𝐬t=ϕ(𝐱t). Here we consider cases where each arm is presented with the same information (e.g. when 𝐬t represents patient information and arms are a fixed set of treatment options – similar to the set up of Bastani2017 ). This can be considered a special case of the framework in which the function ϕ, and hence the context 𝐬t, also depend on action at Abbasi-yadkori2011 ; Szepesvari2018 . The agent observes a reward linearly dependent on the features


for θatd the unknown reward parameters. Let θ=(θa)a𝒜 be the set of all parameters. Let zt=𝐬tTθat denote the expected reward, and let t be the filtration capturing the history of the process up until observing reward at round t: t=σ(a1,,at,η1,,ηt-1).

We make the following standard assumptions:

Assumption 1.

The noise ηt is conditionally σ-subgaussian:


The problem is bounded in the following way:

Assumption 2.

There exist constants B,D0 such that θaB, stD and stTθa[0,1] for all t and aA.

At each round, for the observed context, there is an optimal arm to pull: at*=argmaxa𝐬tTθa, splitting ties arbitrarily. From this we can define the instantaneous regret:


We aim to find a policy π:𝒟𝒜 that minimizes the cumulative regret over T rounds:


Let RT=𝔼(R^T), where the expectation is taken over histories up to horizon T.

1.2 Rarely-switching policies

The optimal policy is a multiclass classifier:

π*(𝐬t)=argmaxa𝐬tTθa. (1)

Here we consider learning policies whose basic form is the following:

π(𝐬t)=argmaxa𝐬tTθ~ta, (2)

and thus that are parameterized by the set θ~t={θ~ta}a𝒜, which can be held constant between rounds. We call this family of policies rarely-switching policies. They can be seen as policies in which decisions are made by thresholding on a linear combination of context variables, which is only updated rarely.

1.3 Confidence bounds

Rarely-switching policies operate based on the provisionally optimal arm given a set of operational parameter values θ~. A policy needs to update these parameters on some schedule to be close to current estimate of the true parameters. We thus maintain the least squares estimate of parameters given the pulls of each particular arm, θ^ta. Let Tta be the set of times for which arm a was pulled up to and including round t. Then let the arm parameter estimates be given by the ridge regression solution

Vta =λI+sTta𝐬s𝐬sT,
θ^ta =(Vta)-1sTtays𝐬s,

for regularization parameter λ>0. Let Vt be the block diagonal matrix constructed from the matrices {Vta}. To construct working rarely-switching bandit algorithms we also must construct confidence bounds for each arm. Care is needed in constructing confidence sets for these estimators because, unlike in standard least squares estimation, the sequence of observations {𝐬s,ys}s=1t are no longer independent and identically distributed – the action chosen depends on the history of the process, which induces correlations among {𝐬𝐭}tTta. By using Martingale theory, we can find confidence intervals that apply for any policy Abbasi-yadkori2011 ; Szepesvari2018 . Let 𝒞t be the following set:


where 𝐱V2=𝐱TV𝐱, for positive definite matrix V, and some bound βt. We can place a lower bound on the probability that the true parameters lie in the set of confidence sets for all rounds. Let this be the event:


Theorem 20.2 of Szepesvari2018 provides an explicit form of βt for which the event t occurs with probability at least 1-δ (provided as Theorem B1 in the supplementary material).

1.4 Feasible contextual linear bandits

Which rarely-switching algorithms learn the optimal parameters? We answer this question by first studying a more general class of algorithms. Consider the general case of the k-armed bandit, 𝒜={1,,k}. For each observed context, 𝐬t, the confidence bounds define plausible intervals of payoffs for each arm. That is, assuming t occurs, the true expected reward for arm a lies in the interval


For each round, assuming the confidence bounds hold, the upper bound payoff of some arms may be below the lower bound payoff of another arm, excluding these arms from plausibly being the best choice. Call any algorithm that never plays these excluded arms a feasible algorithm. With this we can establish the following result:

Theorem 1.

With probability at least 1-δ, the regret of any policy learned by a feasible algorithm is bounded by


All proofs are provided as supplementary material. Corollary 19.3 from Szepesvari2018 can be applied and then provides:

Corollary 1.

Choosing δ=1/T, the expected regret obeys


for constant C>0.

The LinUCB algorithm and the greedy algorithm (θ~t=θ^t, see, for example, Bastani et al Bastani2017 ) are feasible algorithms.

2 Feasible, rarely-switching bandits

We now consider feasible, rarely-switching algorithms that only update their policy when justified. To derive them we first consider the geometry of the linear contextual bandit.

Figure 2: The geometry of linear contextual bandits. (a) The arm parameters define a convex polytope which determines the policy (b). In this example, the decision boundaries are normal vectors to the faces of C. (c) Uncertainty in the parameters creates uncertainty in the polytope, (d) which in turn creates uncertainty in the policy (grey areas). When a decision boundary is definitely outside the plausible interval, moving it to the edge of that interval will improve expected regret of the policy.

For each context, 𝐬t, policies of the form (1) are linear programs murty1983linear . That is, they are solutions to:

max𝐱conv(θ)𝐬tT𝐱, (3)

where Cconv(θ) denotes the convex hull of the set of arm parameters {θa}a=1k, and can equivalently be represented as a set of linear inequalities A𝐱𝐛. C is a closed convex polytope in d. For all 𝐬td{0}, a solution to (3) can found on one of the vertices of conv(θ) (murty1983linear , Theorem 3.3). Thus a solution to (1) is a solution to (3) and at least one solution of (3) is a solution to (1). This correspondence provides a lot of structure. For instance, it is clear that arms which are not extreme points of C are never played. And, the decision boundaries of π(𝐬t) can be seen as contexts 𝐬t which define supporting hyperplanes which contain more than one extreme point (Figure 2a,b).

2.1 Uncertainty in parameters induces uncertainty in policy

The uncertainty in parameter estimates θ^ induces uncertainty in the best policy (Figure 2c). Such problems are studied in robust optimization Ben-Tal2009 ; Bertsimas2011 ; Bertsimas2004 . The difference between a typical problem in robust optimization is that the goal here is not to find a policy that is robust to uncertainty in the constraints (e.g. that chooses the best arm assuming worst-case constraints), but rather the goal is to maintain a policy that is plausibly close to the true policy so that regret is minimized over many rounds.

An algorithm is feasible if it never plays an excluded arm. The confidence sets are constructed such that the decision boundaries lie in decision regions (Figure 2d) with high probability. For two arms, i and j, let θ~ti-θ~tjψtij and let Ψtij be the set difference Ψtij={ψd|ψ=θi-θj,θ𝒞t}. With this, we can prove:

Theorem 2.

Any rarely-switching algorithm that maintains (θ~ti-θ~tj)𝑐𝑜𝑛𝑒(Ψtij) for all i,j[k] and for all t is a feasible algorithm.

2.2 The make-no-mistakes maxim

We seek rarely-switching policies that only change when justified – when their behavior is plausibly sub-optimal. The above considerations suggest that the policy need not be changed when ψtijcone(Ψtij),i,j[k]. When this is not the case, how should the policy be updated? There are a few options based on the exact sense in which the policy should be conservative. We discuss two such ideas. One idea is that updates to the policy should be as small as possible to maintain feasibility, thus minimizing the chance that the update will in fact lead to higher expected regret. Such a policy seeks to minimize mistakes when changing policy. Since the magnitude of ψtij does not matter, we can consider the smallest update to the policy to be the update that keeps θ~t𝒞t and that minimizes the angle between ψtij and ψt+1ij.

2.3 High probability improvement in two dimensions

In the planar case (d=2), the idea of minimizing the change in angle of the decision boundaries can lead, in identifiable circumstances, to a decrease in the expected regret at each round (as always, with probability at least 1-δ). The idea is that when decision boundaries are outside a plausible interval, moving the boundary to the edge of this interval only affects the behavior of the policy in a region where the arm to pull is clear – and thus the policy moves from definitely incurring regret to not necessarily incurring regret for contexts in this region (Figure 2d). The requisite assumptions and a formal statement of this result are provided as Theorem B2 in the supplementary material.

2.4 The conservative rarely-switching bandit

Policy updates that definitely improve expected regret are harder to construct in higher dimensions (d>2). However we can use the same idea in a more general bandit algorithm. The idea of minimizing the angle between each decision boundary pair suggests updating θ~t as follows:


where F is the block matrix that computes the decision boundaries for each pair of arms. That is, Fφ returns ψtij for all pairs i<j. We call this the conservative rarely-switching bandit. This can be understood as a minimization of the cosine distance under inner product xTFTFy (see, for example, Wua ; Choi ). It could be solved with geodesic-convex routines Zhang2016c ; Ferreira2014 . Here we solve the problem using a projected gradient-descent method. The complete algorithm is detailed in the supplementary material. As a comparison, we also considered a simpler, alternative algorithm based on updating parameters simply whenever θ~t𝒞t, instead of basing decisions on when the decision boundaries are not feasible. We compared this algorithm to the conservative rarely-switching bandit in the supplementary material. We found that considering policies in terms of their decision boundaries does lead to fewer changes in policy, with comparable regret.

2.5 The greedy rarely-switching bandit

An issue with the conservative update is that, by aiming to place the decision boundaries on the edge of what is feasible, they may need to be updated again in the next few rounds – contrary to the aim of having a policy that does not need to be updated excessively. An alternative is to update the parameter with greedy updates: whenever Tθ~tcone(TΨt), then set θ~t=θ^t. This modifies the algorithm in an obvious way (see supplementary material). We call this the greedy rarely-switching bandit. It sacrifices confidence that a given update will definitely be an improvement for an algorithm that makes fewer updates to the policy.

3 Results

3.1 Simulations

To test the conservative and greedy rarely-switching bandit algorithms we generate synthetic data from a set of randomly generated arms. Each arm parameter is sampled randomly from d. Each algorithm is run for 10,000 rounds. Here we present results for k=4 arms and d=5 (additional experiments with other choices are provided in the supplementary material). At each round a context is sampled uniformly from 𝒟, and reward is given according to the chosen arm’s mean plus Gaussian noise with standard deviation 0.1. We take δ=0.0001. The bandit algorithms are compared to the LinUCB algorithm Abbasi-yadkori2011 , the rarely-switching variant of the LinUCB algorithm Abbasi-yadkori2011 , the greedy least squares algorithm Bastani2017 , and the conservative LinUCB algorithm (CLUCB) Kazerouni2017 .

Both rarely-switching (RS) algorithms perform well. First, the per-step regret (Rtt) is lowest for the greedy algorithm, consistent with previous bandit comparisons Bietti2018 (Figure 3a). However the RS algorithms perform almost as well, followed by the LinUCB and RS LinUCB methods. The CLUCB algorithm is the slowest to learn: despite the fact that its baseline arm is set to the be the arm with the highest expected return, meaning it starts with an already good policy, ultimately the CLUCB method has the highest per step regret. Second, we see that the mean number of changes for both rarely-switching algorithms is very low (Figure 3b). Indeed, the RS-greedy algorithm achieves low regret with as few as five average changes in policy. Both RS policies changes much less than the RS LinUCB algorithm (250). We also compare each algorithm’s performance relative to a policy that always chooses a fixed, ‘baseline’ arm (set to be the with the highest expected reward). By rarely updating their policy, the rarely-switching algorithms seldom have cumulative reward below the baseline policy (Figure 3c). In contrast, the LinUCB algorithms perform below the baseline rate on average around 6% of the time, while the CLUCB algorithm, by its design, never performs below the baseline. Finally, we examine the proportion of changes which result in lower expected regret for each method (Figure 3d). Consistent with our reasoning, a higher proportion of policy changes made by the RS-conservative algorithm do indeed lead to improved expected regret, compared with changes made by either RS-greedy or RS-LinUCB (both around chance). Taken together, these result show that rarely updating a bandit policy can result in efficient and safe learning, with only a very small number of changes to the policy.

Figure 3: Simulation results. Fifty random generated bandit parameters are run for 10000 rounds. a) The per-step regret (Rt/t). b) The number of policy changes. Both a) and b) traces show mean, plus/minus standard error. c) The percentage of rounds in which each method performs below the cumulative reward obtained when only a baseline arm is played. d) Percentage changes to policy that decrease expected regret. Error bars indicate 95% confidence intervals.

3.2 Real data

Next we apply the algorithms to a medical dataset: a study on the effect of high-quality child care and home visits on future cognitive performance. By presenting subjects sequentially to each algorithm, we test if the algorithms can be used to allocate treatment to subjects such that as many benefit as possible – or least more than would benefit from a randomized control trial (RCT). We use a semi-simulated infant health and development program (IDHP) dataset, in which counterfactuals are simulated from a learned model from the actual trial Hill2011b . This dataset has been used as a benchmark for causal learning (e.g. Shalit2016a ). The dataset consists of 747 subjects, and 100 simulated outcomes and counterfactuals. We observe that all algorithms learn an optimal treatment within the trial. However the rarely-switching and greedy algorithms learn significantly faster than the LinUCB and rarely-switching LinUCB (Figure 4a). Further, both rarely-switching algorithms learns a good policy with approximately 35 changes in policy (Figure 4b). These results suggest that these algorithms can be used to efficiently learn good policies with minimal changes, even over a short trial – making them relevant for adaptive clinical trial designs.

Figure 4: Data results. a) Per-step regret for each method over 100 realizations of IHDP semi-simulated trial. The per-step regret for a totally random policy (an RCT) is shown for comparison. b) Number of cumulative policy changes for each method that rarely switches policy.

4 Conclusion

Here we proposed an approach to optimize policies determined by a linear combination of covariates in a rarely-switching manner. We proved that, in some cases, we can construct updates that are guaranteed to improve the expected reward. Both in simulation and real data, the methods learn as efficiently as other state-of-the-art bandit algorithms. Further, the greedy rarely-switching approach requires very few changes to reach a good policy. Though we do not perform a comprehensive comparison between other bandit algorithms, these findings are consistent with previous bandit studies. For instance, Foster et al 2018 Foster2018 finds greedy algorithms, and algorithms that maintain a set of feasible parameter regions perform well.

There are two large caveats to the rarely-switching approach we developed. First, the algorithms are not guaranteed to converge with almost certainty. For applications where safety is important, this is an important consideration. However this shortcoming is shared by the other methods considered here. Second, in many of the cases considered, a linear model may not be reasonable. This could be addressed by considering GLMs, or by considering local regression models. Indeed, in measuring a causal effect at a threshold, RDD relies only on locally linear models. We may expect our results could be generalized to non-linear settings by considering a local linear regression at the threshold. This is left for future work.

The rarely-switching framework is particularly relevant to adaptive trial designs and cases where the number of arms (treatments) is small. In such cases there is a simultaneous goal of learning the treatment effect and maximizing positive outcomes. As discussed in the introduction, there is a close relationship between learning with bandit feedback and causal inference. Indeed, a number of recent studies have explored intersections of causal inference and bandit problems, considering cases where the actions of an agent are confounded Bareinboim2016 ; Forney2017 ; Zhang2017a ; Bareinboim2015 ; Lattimore2016 ; Lee2018 , or where methods from causal inference such as matching can prove useful Dimakopoulou2017 ; Offer-westort2018 . We complement these methods by showing how adaptive trials can be performed with few changes to policy.

The rarely-switching framework is also relevant to any policy determined by a fixed threshold. A number of authors have noted that thresholds could be used a lot more in medicine to measure causal effects Moscoe2015 ; Marinescu2018 . Our work shows how these thresholds could also be optimized.

More generally, the rarely-switching framework could be extended to policies in Markov Decision processes (MDPs). Safety in these settings is increasingly of interest Fan2019 ; Laroche2018 ; Komorowski2019 ; Carrara2019 ; Fulton2019 . We believe our approach method could be extended to make conservative changes to policies, with applications in particular to personalized medicine. Further, because a policy in deep reinforcement learning can be parameterized as a linear readout of a set of learned features Berg2016 , this approach could also be applied to deep learning and deep reinforcement learning Hechtlinger2019 . This is also left as future work.

In many real world settings, changing policies is expensive or difficult. As the use of machine learning to find optimal policies in medicine and economics increases, considering this inertia is increasingly important. Using the theory of contextual linear bandits, here we provided one of the first analyses of this problem.


  • [1] Yasin Abbasi-yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits. NIPS 2011, pages 1–19, 2011.
  • [2] Hossein Aboutalebi, Doina Precup, and Tibor Schuster. Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials. ArXiv e-prints, 2019.
  • [3] Yizhuang Alden Cheng. Regression Discontinuity Designs with Multiple Assignment Variables *. Technical report, 2016.
  • [4] Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with Unobserved Confounders: A Causal Approach. Advances in Neural Information Processing Systems, (November):1–9, 2015.
  • [5] Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
  • [6] Eleanor Barry, Samantha Roberts, Jason Oke, Shanti Vijayaraghavan, Rebecca Normansell, and Trisha Greenhalgh. Efficacy and effectiveness of screen and treat policies in prevention of type 2 diabetes : systematic review and meta-analysis of screening tests and interventions. BMJ, 356, 2017.
  • [7] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly Exploration-Free Algorithms for Contextual Bandits. ArXiv e-prints, pages 1–58, 2017.
  • [8] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, Princeton, 2009.
  • [9] E Van Den Berg. Some Insights into the Geometry and Training of Neural Networks. ArXiv e-prints, (2):1–33, 2016.
  • [10] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian Optimization with Safety Constraints : Safe and Automatic Parameter Tuning in Robotics. ArXiv e-prints, 2016.
  • [11] Marinho Bertanha. Regression Discontinuity Design with Many Thresholds. SSRN, pages 1–62, 2019.
  • [12] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and Applications of Robust Optimization. SIAM Review, 53(3):464–501, 2011.
  • [13] Dimitris Bertsimas and Melvyn Sim. Robust Discrete Optimization Under Ellipsoidal Uncertainty Sets. Technical Report, pages 1–23, 2004.
  • [14] Alberto Bietti, Alekh Agarwal, and John Langford. A Contextual Bandit Bake-off. ArXiv e-prints, 2018.
  • [15] Jacob Bor, Ellen Moscoe, Portia Mutevidzi, Marie-Louise Newell, and Till Barnighausen. Regression Discontinuity Designs in Epidemiology. Epidemiology, 25(5):729–737, 2014.
  • [16] Christopher Carpenter and Carlos Dobkin. The minimum legal drinking age and public health. J Econ Perspect, 25(2):133–156, 2011.
  • [17] Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-ambrym Maillard, Olivier Pietquin, and Renault Group. Scaling up Budgeted Reinforcement Learning. ArXiv e-prints, 2019.
  • [18] Matias D. Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo Vazquez-Bare. Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs. ArXiv e-prints, 2019.
  • [19] Matis D Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo Vazquez-Bare. Interpreting Regression Discontinuity Designs with Multiple Cutoffs. Journal of Politics, 78(4), 2016.
  • [20] Jonghyun Choi and Larry S Davis. Toward Sparse Coding on Cosine Distance. International Conference on Pattern Recognition, 22, 2014.
  • [21] Yinlam Chow, Ofir Nachum, Aleksandra Faust, and Edgar Duenez-guzman Mohammad. Lyapunov-based Safe Policy Optimization for Continuous Control. ArXiv e-prints, 2019.
  • [22] W Chu and L Li. Contextual bandits with linear payoff functions. International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • [23] Yahel David, Balázs Szörényi, Mohammad Ghavamzadeh, Shie Mannor, and Nahum Shimkin. PAC Bandits with Risk Constraints. ISAIM, 2018.
  • [24] Cheryl Dennison-himmelfarb, Joel Handler, and Daniel T Lackland. 2014 Evidence-Based Guideline for the Management of High Blood Pressure in Adults. JAMA, 311(5):507–520, 2014.
  • [25] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Estimation Consideration in Contextual Bandits. ArXiv e-prints, 2017.
  • [26] Yingying Dong and Arthur Lewbel. Identifying the effect of changing the policy threshold in regression discontinuity models. The review of economics and statistics, 97(December):1081–1092, 2015.
  • [27] Jiameng Fan and Wenchao Li. Safety-guided deep reinforcement learning via online Gaussian process estimation. ArXiv e-prints, pages 1–13, 2019.
  • [28] O P Ferreira, A N Iusem, and S Z Nemeth. Concepts and techniques of optimization on the sphere. TOP, 22:1148–1170, 2014.
  • [29] Sarah Filippi, Olivier Capp, Aurelien Garivier, and Csaba Szepesvari. Parametric Bandits : The Generalized Linear Case. NIPS, pages 1–9, 2010.
  • [30] Andrew Forney, Judea Pearl, Elias Bareinboim, and West Lafayette. Counterfactual Data-Fusion for Online Reinforcement Learners. Proceedings of the 34th International Conference on Machine Learning, 70(June):1156–1164, 2017.
  • [31] Dylan J Foster, Alekh Agarwal, Miroslav Dud, and Luo Robert. Practical Contextual Bandits with Regression Oracles. ArXiv e-prints, 2018.
  • [32] Nathan Fulton and Andre Platzer. Verifiably Safe Off-Model Reinforcement Learning. ArXiv e-prints, 2019.
  • [33] Melody Y Guan and Heinrich Jiang. Nonparametric Stochastic Contextual Bandits. AAAI, 2018.
  • [34] Sudipto Guha and Kamesh Munagala. Multi-armed Bandits with Metric Switching Costs. In Proceedings of the 36th Internatilonal Collogquium on Automata, Languages and Programming: Part II, ICALP ’09, pages 496–507, Berlin, Heidelberg, 2009. Springer-Verlag.
  • [35] Yotam Hechtlinger, Barnabás Póczos, and Larry Wasserman. Cautious Deep Learning. ArXiv e-prints, 2019.
  • [36] Jennifer L Hill. Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • [37] Guido Imbens and Stefan Wager. Optimized Regression Discontinuity Designs. ArXiv e-prints, pages 1–22, 2017.
  • [38] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. ICML, 2002.
  • [39] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem. NIPS, 2018.
  • [40] Sumeet Katariya, Branislav Kveton, Zheng Wen, and Vamsi K. Potluru. Conservative Exploration using Interleaving. PMLR, 89:1–16, 2019.
  • [41] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, and Benjamin Van Roy. Conservative Contextual Linear Bandits. Advances in neural information processing, 2017.
  • [42] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo. Understanding the Artificial Intelligence Clinician and optimal treatment strategies for sepsis in intensive care General comments on the safety of the AI Clinician. ArXiv e-prints, pages 1–13, 2019.
  • [43] Tomer Koren, Roi Livni, and Yishay Mansour. Bandits with Movement Costs and Adaptive Pricing. COLT, 2017.
  • [44] Tomer Koren, Roi Livni, and Yishay Mansour. Multi-Armed Bandits with Metric Movement Costs. Advances in Neural Information Processing Systems, 30, 2017.
  • [45] Andreas Krause and Cheng Soon Ong. Contextual Gaussian Process Bandit Optimization. Advances in Neural Information Processing Systems, pages 1–9, 2011.
  • [46] Romain Laroche and Orange Labs. Reinforcement learning algorithm selection. ICLR, 2018.
  • [47] Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal Bandits: Learning Good Interventions via Causal Inference. Advances in Neural Information Processing Systems, 2016.
  • [48] David S Lee. Randomized Experiments from Non-random Selection in US House Elections. Journal of Econometrics, 142(2), 2008.
  • [49] Sanghack Lee and Elias Bareinboim. Structural Causal Bandits with Non-manipulable Variables. AAAI, (November), 2018.
  • [50] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. WWW, 2010.
  • [51] Ioana E Marinescu, Patrick N Lawlor, and Konrad P Kording. Quasi-experimental causality in neuroscience and behavioral research. Nature Human Behaviour, 2(December):891–898, 2018.
  • [52] Ioana Elena Marinescu, Konrad Paul Kording, Sofia Triantafillou, and Konrad Paul Kording. Regression Discontinuity Threshold Optimization. SSRN, (1):1–6, 2019.
  • [53] Ellen Moscoe, Jacob Bor, and Till Bärnighausen. Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: A review of current and best practice. Journal of Clinical Epidemiology, 68(2):122–133, 2015.
  • [54] Brian G Moss, William H Yeaton, and Jane E Lloyd. Evaluating the Effectiveness of Developmental Mathematics by Embedding a Randomized Experiment Within a Regression Discontinuity Design. Educational Evaluation and Policy Analysis, 36(2):170–185, 2014.
  • [55] Katta G Murty. Linear programming. Springer, 1983.
  • [56] Matthew Nayor, Ramachandran S Vasan, and Heart Study. Recent update to the US Cholesterol Treatment Guidelines: A comparison with international guidelines. Circulation, 133(18):1795–1806, 2017.
  • [57] Molly Offer-westort and Alexander Coppock. Adaptive Experimental Design : Prospects and Applications in Political Science. Annual Meeting of the American Political Science Association, 2018.
  • [58] Ankur Pandya, Stephen Sy, Sylvia Cho, Milton C Weinstein, and Thomas A Gaziano. Cost-effectiveness of 10-year risk thresholds for initiation of statin therapy for primary prevention of cardiovascular disease. JAMA, 314(2):142–150, 2015.
  • [59] John P Papay, John B Willett, and Richard J Murnane. Extending the regression-discontinuity approach to multiple assignment variables. Journal of Econometrics, 161(2):203–207, 2011.
  • [60] Judea Pearl. Causality: models, reasoning and inference. Cambridge Univ Press, 2000.
  • [61] Paat Rusmevichientong and John N Tsitsiklis. Linearly Parameterized Bandits. ArXiv e-prints, 2010.
  • [62] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-Aversion in Multi-armed Bandits. Advances in neural information processing, pages 1–20, 2013.
  • [63] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect : generalization bounds and algorithms. ArXiv e-prints, 0:1–22, 2016.
  • [64] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ArXiv e-prints, 2009.
  • [65] Wen Sun, Debadeepta Dey, and Ashish Kapoor. Risk-Aware Algorithms for Adversarial Contextual Bandits. ArXiv e-prints, 4:1–28, 2016.
  • [66] Csaba Szepesvari and Tor Lattimore. Bandit Algorithms. Cambridge University Press, page 542, 2018.
  • [67] Donald L Thistlewaite, Donald T Campbell, Peter Aronow, Nicole Basta, Betz Halloran, Matias Cattaneo, Gonzalo Vazquez-Bare, Guido Imbens, Alessan-Dra Mattei, Fabrizia Mealli, Jasjeet Sekhon, Rocío Titiunik, Vivian Wong, and Coady Wing. Regression-discontinuity Analysis: An alternative to the ex-post facto experiment. Journal of Educational Psychology, 51:309–317, 1960.
  • [68] Sattar Vakili and Qing Zhao. Risk-Averse Multi-Armed Bandit Problems under Mean-Variance Measure. IEEE Journal on Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
  • [69] Vivian Wong, Peter Steiner, and Thomas Cook. Analyzing Regression-Discontinuity Designs With Multiple Assignment Variables : A Comparative Study of Four Estimation Methods. Journal of Educational and Behavioral Statistics, 38(2):107–141, 2013.
  • [70] Vivian C Wong, Peter M Steiner, and Thomas D Cook. Analyzing Regression-Discontinuity Designs with Multiple Assignment Variables : A Comparative Study of Four Estimation Methods. 2010.
  • [71] Xiang Wu, Zhi-guo Shi, and Lei Liu. Quasi Cosine Similarity Metric Learning. In C Jawahar and S Shan, editors, Lecture Notes in Computer Science, vol 9010. 2014.
  • [72] Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative Bandits. ICML, 48, 2016.
  • [73] Hongyi Zhang and Suvrit Sra. First-order Methods for Geodesically Convex Optimization. JMLR, 49(1):1–22, 2016.
  • [74] Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandits: A Causal approach. IJCAI International Joint Conference on Artificial Intelligence, pages 1340–1346, 2017.
  • [75] Li Zhou. A Survey on Contextual Multi-armed Bandits. ArXiv e-prints, 2015.
  • [76] Alexander Zimin, Rasmus Ibsen-jensen, Krishnendu Chatterjee, and L G May. Generalized Risk-Aversion in Stochastic Multi-Armed Bandits. ArXiv e-prints, 2014.

Supplementary material

Appendix A Algorithms

We implement a projected gradient method. To solve


the method consists of alternating niter times between gradient updates to maximize:

K(φ)=θ~tTFTFφFθ~tFφ, (4)

followed by projections onto the convex set 𝒞t. Let J=FTF. Then the gradient with respect to φ is:


We decide to update the parameters if the smallest angle between θ~t-1 and φ𝒞t is above some threshold, arccos(Δ), and if the parameters θ~t-1 are indeed not in the current confidence set 𝒞t. The algorithm has parameters, niter, step size ϵ, and a tolerance at which we decide two vectors have the same angle, Δ. Here we use niter=100, ϵ=0.1 and Δ=0.01. The algorithm was tested for robustness to the choice in Δ – values of Δ between 10-1 and 10-4 do not affect the regret behavior (Supplementary Figure 5).


The conservative, rarely-switching bandit {algorithmic} \Requireinitial policy θ~0, regularization parameter λ, number of rounds T, initial gradient step size ϵ, tolerance Δ \StateInitialize V0λI\Fort[0,T] \StateObserve context: 𝐬t\StateChoose action: atargmaxa𝐬tTθ~t-1 and obtain reward yt\StateUpdate parameters: Vt, Ut, θ^t, βt and bounds 𝒞t\Stateφ0θ~t-1\Fori[1,niter] \StateSet step size: ηϵ/i\StateGradient step, using Suppl. Eq. (4): φφi-1+ηKφ(φi-1)\StateProject onto 𝒞t: φiargminϕ𝒞tϕ-φVt2\EndFor\Stateboundary_angles θ~t-1TJφniterθ~t-1JφniterJ\Ifboundary_angles <1-Δ and θ~t-1-θ^tVt>βt \Stateθ~tφniter\Else\Stateθ~tθ~t-1\EndIf\EndFor

The greedy algorithm only changes the last section of the algorithm. When an updated is required, it sets θ~t=θ^t.


The greedy, rarely-switching bandit {algorithmic} \Requireinitial policy θ~0, regularization parameter λ, number of rounds T, initial gradient step size ϵ, tolerance Δ \StateInitialize V0λI\Fort[0,T] \StateObserve context: 𝐬t\StateChoose action: atargmaxa𝐬tTθ~t-1 and obtain reward yt\StateUpdate parameters: Vt, Ut, θ^t, βt and bounds 𝒞t\Stateφ0θ~t-1\Fori[1,niter] \StateSet step size: ηϵ/i\StateGradient step, using Equation (4): φφi-1+ηKφ(φi-1)\StateProject onto 𝒞t: φiargminϕ𝒞tϕ-φVt2\EndFor\Stateboundary_angles θ~t-1TJφniterθ~t-1JφniterJ\Ifboundary_angles <1-Δ and θ~t-1-θ^tVt2>βt \Stateθ~tθ^t\Else\Stateθ~tθ~t-1\EndIf\EndFor

Figure 5: Robustness of RS algorithms to value of Δ. On IHDP dataset, value of delta between 10-1 and 10-4 has little impact on per step regret. But does affect the number of policy changes made. Dashed lines represent RS-greedy, solid lines represent RS-conservative for different Δ tolerances.

Appendix B Proofs of Theorems

The following is a restatement of Theorem 20.2 from [66], and provides an explicit form for the size of the uncertainty sets. We omit the proof here.

Theorem B1.

For any δ(0,1), with probability at least 1-δ, it holds that for all tN+,


Also, for θL then P(Et)1-δ with Ct defined using

Theorem 1.

With probability at least 1-δ, the regret of any policy learned by a feasible algorithm is bounded by


For a given context 𝐬t let 𝒥(𝐬t)𝒜 denote the set of valid arms that are not excluded from being played according to their confidence bounds. Assuming the confidence bounds hold, then if 𝒥(𝐬t) contains only one arm then a feasible algorithm incurs no regret in that round. Otherwise, let i* and j* be the pair of valid arms with the largest difference in payout for 𝐬t:


The optimal arm is one of the arms in 𝒥(𝐬t), as is the choice of a feasible algorithm, thus this is a bound for the regret in round t. Without loss of generality, let i* be the arm with the higher estimated payoff for context 𝐬t. Since j* is not excluded, the lower bound of arm i* is below the upper bound of arm j*. This means that:

𝐬tθ^ti*-βt𝐬t(Vti*)-1 𝐬tθ^tj*+βt𝐬t(Vtj*)-1
|𝐬t(θ^ti*-θ^tj*)| βti*𝐬t(Vti*)-1+βtj*𝐬t(Vtj*)-1 (5)


rt |𝐬tT(θi*-θj*)|

using: Equation (5); the Cauchy-Schwarz inequality; the fact that θ𝒞t; and the fact that 𝐬(Vt)-12=i=1k𝐬(Vti)-12, and thus that 𝐬(Vti)-1𝐬(Vt)-1 for any i. Up to a factor of 2, this is exactly the form of the regret bound obtained in the proof of Theorem 19.2 [66]. Thus the rest of the proof proceeds identically from there and establishes the result. ∎

Theorem 2.

Any rarely-switching algorithm that maintains (θ~ti-θ~tj)𝑐𝑜𝑛𝑒(Ψtij) for all i,j[k] and for all t is a feasible algorithm.


A feasible algorithm never plays an excluded arm. This can be judged on a pairwise basis: if the expected payoff for arm j is lower than that of any other arm i, then j is excluded. For a given context 𝐬t, all excluded arms can be determined by considering all pairs of arms. Thus we need only focus on characterizing the decision boundaries between pairs of arms. The confidence sets are constructed such that these decision boundaries lie in decision regions with high probability. For two arms, i and j, let Ψtij be the set difference Ψtij={ψd|ψ=θi-θj,with θ𝒞t}.

Then the decision boundary between arms i and j lies in the set:


This region can be understand as the complement of the union of two sets: the dual cone of Ψtij and the polar cone of Ψtij. The dual cone corresponds to contexts in which, for any set of feasible parameters θ𝒞t, arm i is better: Ct*ij={𝐬d|𝐬Tψij>0,ψijΨtij}. The polar cone, Dt*ij-Ct*ij is the opposite: it represents contexts for which, given any combination of feasible parameters, arm j is better.

If θ~i-θ~jcone(Ψtij) then α>0 such that


for some θi,θj𝒞t. This means there exists feasible parameters that have the same decision boundary as θ~i,θ~j. This means that, for all 𝐬Ct*ij, where arm θi is better, the policy θ~ will not play arm j (the excluded arm). Thus θ~ does not play any excluded arm, and is a feasible algorithm. Note that if 0Ψtij then any decision boundary between i and j is plausible, and this pair of arms does not constrain if θ~ is feasible or not.

We can prove in d=2 there are updates that always improve the expected regret. The intuition is that, in two dimensions, the policy θ~t defines intervals over angles in which each arm is played. The problem thus becomes one dimensional – it can be considered as a cyclic ordering of the arms based on angle, σ=(σ(θ~1)σ(θ~k)), and a set of angles corresponding to the decision boundaries between the pair of arms θ~σi and θ~σi+1. Then we have:

Theorem B2.

For d=2, if an update to the policy θ~t satisfies:

  1. 1.

    the order of the arms σ is the same for θ~t and θ~t+1,

  2. 2.

    each new decision boundary ψt+1σi,σi+1 stays within the next and previous decision boundaries given by θ~t:

  3. 3.

    each change ψtijψt+1ij is the smallest such that ψt+1ij𝑐𝑜𝑛𝑒(Ψt+1ij). I.e. that

  4. 4.

    If ψtσi,σi+1ψt+1σi,σi+1, then in each interval [ψtσi,σi+1,ψt+1σi,σi+1], there are no plausible decision boundaries that involve σi (and σi+1 if instead ψtσi,σi+1>ψt+1σi,σi+1)

then it does not increase the expected instantaneous regret, for a fixed policy, in the sense that:


with probability at least 1-δ.


In the planar case, edges constitute the faces of the polytope, thus specifying the decision boundaries (edges) is the same as the H-representation of the polytope – it uniquely specifies the polytope. Thus we can equally think of dealing with vertices (arm parameters), or decision boundaries. Here we deal entirely with the decision boundaries.

The assumptions on the change in policy implicitly assume that the number of playable arms stays the same between t and t+1 – that is, an arm does not become dominated (stops being an extreme point of the polytope), or stop being dominated. Note also that if 0Ψt+1ij then any boundary ψt+1ij is feasible, and this does not contribute to changing the policy.

We prove the result by considering each interval [ψtσi,σi+1,ψt+1σi,σi+1]. By the assumptions of the update, no interval overlaps with any other interval. This is because, by the conditions of the theorem, the new border is the first boundary encountered. That is, we can speak about the new boundary ψt+1σi,σi+1 as being either before or after the prior boundary ψtσi,σi+1. If after, then both of the following are true:

ψtσi,σi+1 ψt+1σi,σi+1ψtσi+1,σi+2(condition ii)
ψtσi,σi+1 ψt+1σi,σi+1ψt+1σi+1,σi+2(condition i)

and if before, then

ψt+1σi,σi+1 ψtσi,σi+1ψtσi+1,σi+2(ordering)
ψt+1σi,σi+1 ψtσi,σi+1ψt+1σi+1,σi+2(condition ii)

In either case, the next interval [ψtσi+1,σi+2,ψt+1σi+1,σi+2] does not overlap with the current interval [ψtσi,σi+1,ψt+1σi,σi+1]. The same argument applies to the previous interval. Thus all such intervals do not overlap with one another. Since the behavior of θ~t and θ~t+1 is the same outside of these intervals then to establish that the change in policy does not incur more regret we just need to study the behavior of the policy in each of these intervals separately.

So, for contexts in a given interval, between t and t+1, the policy switches from playing σi+1 to σi (or vice versa, depending on the ordering). Assume the policy switches from σi+1 to σi; the same argument applies to the opposite case. Either the interval is empty, and the policy behaves the same between t and t+1, or by assumption the decision boundary is moved just to the edge of the feasible boundary between arms σi and σi+1.

Assume the policy is moved just to the edge of the feasible boundary between the arms. We need to consider two scenarios, either it is plausible a true decision boundary between σi+1 and another arm lies in the interval, or it does not.

By condition iv, there cannot exist a plausible true decision boundary between σi+1 and another arm. Further, since the movement to the boundary of the feasible region means that the interval cannot contain the true decision boundary σi,σi+1. Thus, by the conditions, in the interval there is no way that playing σi+1 was feasible. The old policy played σi+1 in this interval, and so only incurs regret. By changing the policy to play σi in this interval, expected regret cannot increase. ∎

Remark: The assumptions are not likely to be true at the start of the trial, where the ordering of the arms is unclear and the uncertain policy regions are not isolated from one another. However these assumptions may be reasonable later, in which case they provide a guarantee that the bandit never decreases its performance. Each condition is easy to check, thus the result provides a make-no-mistakes algorithm in 2D.

Appendix C Extra simulation results

We perform the same simulations as in the main text, except vary the number of dimensions d and the number of arms k. We find similar results for d5. For d=2, the rarely switching bandits do not perform as well as LinUCB (Supplementary Figure 6).

Figure 6: Simulation results varying dimension d and number of arms k. Fifty random generated bandit parameters are run for 10000 rounds. Plotted are the per-step regret (Rt/t) and the number of policy changes. Traces show mean, plus/minus standard error.

Appendix D Comparison to other possible feasible algorithms

A simpler algorithm worth considering is just to take:


That is, as soon as the parameters θ~t are not feasible then update θ~. This will result in more updates than the rarely-switching algorithms we propose, which decide when to update based on the decision boundaries being not feasible, as opposed to the arm parameters. A greedy version of this would be to take:


To distinguish these algorithms from the rarely-switching versions, call these the feasible conservative, and feasible greedy algorithms, respectively.

We compare these algorithms to the rarely-switching algorithms on the same synthetic set of linear bandits as in the main text and above. These feasible algorithms have good per-step regret on the same simulated tasks (Supplementary Figure 7a). However, the number of changes to the policy is greater than the rarely-switching conservative and greedy versions (Supplementary Figure 7b). Thus considering only updating policy when the decision boundary becomes infeasible, as opposed to the arm parameters, is worthwhile if the goal is to minimize policy changes. Further, the feasible-conservative algorithm does not produce updates which have a good chance of leading to decreased regret, in contrast to the rarely-switching conservative algorithm (Supplementary Figure 7d).

Figure 7: Feasible and rarely-switching comparison results. Fifty random generated bandit parameters are run for 10000 rounds. a) The per-step regret (Rt/t). b) The number of policy changes. Both a) and b) traces show mean, plus/minus standard error. c) The percentage of rounds in which each method performs below the cumulative reward obtained when only a baseline arm is played. d) Percentage changes to policy that decrease expected regret. Error bars indicate 95% confidence intervals.