Abstract
Exploring the effect of policies in many real world scenarios is difficult,unethical, or expensive. After all, doctor guidelines, tax codes, and pricelists can only be reprinted so often. We may thus want to only change a policywhen it is probable that the change is beneficial. Fortunately, thresholdsallow us to estimate treatment effects. Such estimates allows us to optimizethe threshold. Here, based on the theory of linear contextual bandits, wepresent a conservative policy updating procedure which updates a deterministicpolicy only when needed. We extend the theory of linear bandits to thisrarelyswitching case, proving such procedures share the same regret, up toconstant scaling, as the common LinUCB algorithm. However the algorithm makesfar fewer changes to its policy. We provide simulations and an analysis of aninfant health wellbeing causal inference dataset, showing the algorithmefficiently learns a good policy with few changes. Our approach allowsefficiently solving problems where changes are to be avoided, with potentialapplications in economics, medicine and beyond.
Quick Read (beta)
Rarelyswitching linear bandits: optimization of causal effects for the real world
Abstract
Exploring the effect of policies in many real world scenarios is difficult, unethical, or expensive. After all, doctor guidelines, tax codes, and price lists can only be reprinted so often. We may thus want to only change a policy when it is probable that the change is beneficial. Fortunately, thresholds allow us to estimate treatment effects. Such estimates allows us to optimize the threshold. Here, based on the theory of linear contextual bandits, we present a conservative policy updating procedure which updates a deterministic policy only when needed. We extend the theory of linear bandits to this rarelyswitching case, proving such procedures share the same regret, up to constant scaling, as the common LinUCB algorithm. However the algorithm makes far fewer changes to its policy. We provide simulations and an analysis of an infant health wellbeing causal inference dataset, showing the algorithm efficiently learns a good policy with few changes. Our approach allows efficiently solving problems where changes are to be avoided, with potential applications in economics, medicine and beyond.
Rarelyswitching linear bandits: optimization of causal effects for the real world
Benjamin J. Lansdell, Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104 [email protected] Sofia Triantafillou Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittburgh, PA 15206 Konrad P. Kording Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Many decisions in healthcare, economics, and beyond are based on thresholds Moscoe2015 ; Bor2014 . Scholarships are awarded when exam scores exceed a threshold Thistlewaite1960 , assistance is given to those below an income thresholds, and many medical treatments and diagnoses are based on thresholding biometrics (e.g. cholesterol Nayor2017 ; Pandya2015 , high blood pressure Dennisonhimmelfarb2014 , and diabetes/prediabetes Barry ). Such policies are often applied where it is difficult or unethical to experiment with the population and a policy should benefit the population as much as possible. Exploration is avoided: if a patient arrives with high blood pressure then they receive treatment – we can not explore with the aim of learning more about the treatment effect. Prior studies will typically have established treatment benefits. Thresholds are used when only the population above the threshold is expected to benefit from the treatment. While these policies are generally chosen based on domain knowledge, they could be further optimized.
Optimizing a policy while acting greedily is challenging. Particularly because, in cases where policies are health or economic recommendations, it may be difficult or costly to change a policy excessively. That is, there is an added need to find an optimal threshold in a way that minimizes both the number of erroneous changes and the number of total changes to the policy. In these cases changes to a policy should only be made when there is sufficient evidence it will improve outcomes – they should be rarely switching Abbasiyadkori2011 . There is a conflict between the greedy aims of such a policy and the need to explore to optimize it.
How can a policy be efficiently learned while maintaining confidence each update will improve utility? The effect of a change in policy is an extrapolation: the effect of treatment for those well below a current threshold can only be estimated from the population above that threshold, since they are the only subjects to receive treatment according to the policy. Yet there may be many reasons a treatment will effect these populations differently. However, the populations just above and below the threshold are similar – the only difference between them is that one group receives the treatment and the other does not. In fact this allows for a measure of the causal effect of the treatment for this subpopulation pearl2000causality . And this in turn provides a measure of the change in utility for a small change in policy.
The idea of measuring causal effects by comparing populations at a threshold is known as regression discontinuity design (RDD), and is commonly used in econometrics Imbens2017 . RDD is often used when a policy is determined by thresholding on a single covariate (Figure 1a). It has been used to estimate the effects of education programs Moss2014 , election outcomes Lee2008b , and policies determined by ageeligibility Carpenter2011 . It can also be extended to higher dimensions (Figure 1b; Bertanha2019 ; Cattaneo2019 ; Cheng2016 ; Wong2010 ; Cattaneo2016 ) or setting with multiple treatment options (Figure 1c; Papay2011 ). For instance, students may take tests with thresholds in several different subjects, and school districts may reward teachers for improving student test scores in both mathematics and English Papay2011 ; Wong2013 . When the conditions for their use are valid, RD designs are a reliable way to measure treatment effects. The above reasoning shows that causal effects as measured by RDD can be used to make small updates to a threshold with high confidence it will improve utility Dong2015a ; Marinescu . However, what schedule of updates to a threshold will converge on an optimal policy?
We propose to answer this question by framing the problem as a contextual multiarmed bandit. The balance between exploration and exploitation is well studied in multiarm bandits. In a bandit problem an agent must choose from a set of actions at each round and only observes reward for the action chosen Szepesvari2018 . A contextual bandit provides the agent with side information about the set of choices at each round, which can be used to make a more informed decision Zhou2015a . Contextual multiarmed bandits are used in personalized medicine, targeted advertising and website design. Much is known under a variety of assumptions about the relation between context and reward – e.g. using linear models Abbasiyadkori2011 ; Chu2011 ; Li2010a , Gaussian processes Krause2011 ; Srinivas2009 ; Guan2018 , and GLMs Filippi2010 . Further, there is an intimate relation between learning in a bandit setting and the defining problem of causal inference – that only one of multiple potential outcomes are ever observed – and thus there is a natural correspondence between multiarmed bandits and optimization in a setting where causal effects must also be learned. Many of the above motivating examples, once causal effects are identifiable, could be optimized using theory of multiarmed bandits. Here we draw from this theory to understand how to optimize rarely switching policies.
As machine learning and AI are increasingly used to aid decision making, there is a growing need for safety and performance guarantees to be understood and implemented Berkenkamp2015 ; Aboutalebi2019 . Thus a lot of recent work considers conservative or safe bandit algorithms. For instance, methods that do not perform below some baseline level Wu2016 ; Katariya2018 ; Kazerouni2017 ; Kakade ; Sun2016a ; Chow2019 , or that are riskaverse, avoiding large variance in outcomes Vakili2016 ; Sani2013 ; David2016 ; Zimin2014 , have been developed. In particular, recent work has considered explorationfree, or greedy bandit algorithms, in a linear contextual setting Bastani2017 ; Kannan2018 . Surprisingly, under certain distributional assumptions, noise alone provides the necessary exploration to allow learning for free Bastani2017 . In fact, a recent comparison between contextual bandit algorithms shows that greedy algorithms can do quite well in practice Bietti2018 ; Foster2018 . This suggests that sufficient conditions for a greedy algorithm to work well do occur: exploration bonuses can indeed be avoided in some settings. For this reason here we investigate rarelyswitching policy optimization by casting the problem as a linear contextual bandit.
Only limited previous work has considered rarelyswitching policies. For instance, the case where there is a cost to changing actions has received attention Guha2009 ; Koren2017 ; Koren2017a . AbbasiYadkori et al 2011 Abbasiyadkori2011 consider a rarelyswitching class of Linear upper confidence bound (LinUCB) algorithms. However their aim is to save computation, and not to minimize the number of changes made to avoid exploration with policy. Rather than considering cases where it is costly to change actions, we consider the case where it is costly to change policy.
Here we study how to find optimal thresholds while minimizing changes in policy as a linear contextual bandit. We consider a family of policies defined by a fixed threshold on a linear combination of covariates, that rarely switches, and that generalizes to higher dimensions and number of arms. We propose conservative policy updating algorithms that only update their threshold when justified. This extends the theory of the linear upper confidence bound (LinUCB) algorithm, and so the algorithms, up to constants, share its asymptotic regret behavior Abbasiyadkori2011 . However they result in fewer erroneous changes to the policy, and fewer overall changes to the policy. Empirically, we show that these methods compare favorably to previous a linear bandit algorithm that rarely switch its policy Abbasiyadkori2011 . Further, we show how such algorithms compare to methods that improve on a baseline policy, such as the conservative linear UCB (CLUCB) algorithm Kazerouni2017 , where with high probability the algorithm always performs better than a baseline, initial policy. We verify the method with simulated data, and show its utility on an infant health, causal inference dataset. This work thus shows how conservative bandit approaches can be extended to a common and important decision making framework.
1.1 Linear contextual bandits
We study the following version of the common stochastic linear bandit Tsitsiklis2010 ; Szepesvari2018 ; Kazerouni2017 ; Abbasiyadkori2011 . At each round, $t$, the agent observes a context ${\mathbf{s}}_{t}\in \mathcal{D}\subseteq {\mathbb{R}}^{d}$ and then selects an action ${a}_{t}\in \mathcal{A}$. We assume each context is drawn independently from a distribution $\rho $. The context may represent a feature vector from an underlying state: ${\mathbf{s}}_{t}=\varphi ({\mathbf{x}}_{t})$. Here we consider cases where each arm is presented with the same information (e.g. when ${\mathbf{s}}_{t}$ represents patient information and arms are a fixed set of treatment options – similar to the set up of Bastani2017 ). This can be considered a special case of the framework in which the function $\varphi $, and hence the context ${\mathbf{s}}_{t}$, also depend on action ${a}_{t}$ Abbasiyadkori2011 ; Szepesvari2018 . The agent observes a reward linearly dependent on the features
$${y}_{t}={\mathbf{s}}_{t}^{T}{\theta}^{{a}_{t}}+{\eta}_{t},$$ 
for ${\theta}^{{a}_{t}}\in {\mathbb{R}}^{d}$ the unknown reward parameters. Let $\theta ={({\theta}^{a})}_{a\in \mathcal{A}}$ be the set of all parameters. Let ${z}_{t}={\mathbf{s}}_{t}^{T}{\theta}^{{a}_{t}}$ denote the expected reward, and let ${\mathcal{F}}_{t}$ be the filtration capturing the history of the process up until observing reward at round $t$: ${\mathcal{F}}_{t}=\sigma ({a}_{1},\mathrm{\dots},{a}_{t},{\eta}_{1},\mathrm{\dots},{\eta}_{t1})$.
We make the following standard assumptions:
Assumption 1.
The noise ${\eta}_{t}$ is conditionally $\sigma $subgaussian:
$$\mathbb{E}\left({e}^{c{\eta}_{t}}{\mathcal{F}}_{t}\right)\le \mathrm{exp}(c{\sigma}^{2}/2),\forall c\in \mathbb{R}.$$ 
The problem is bounded in the following way:
Assumption 2.
There exist constants $B\mathrm{,}D\mathrm{\ge}\mathrm{0}$ such that $\mathrm{\parallel}{\theta}^{a}\mathrm{\parallel}\mathrm{\le}B$, $\mathrm{\parallel}{\mathrm{s}}_{t}\mathrm{\parallel}\mathrm{\le}D$ and $\mathrm{\parallel}{\mathrm{s}}_{t}^{T}\mathit{}{\theta}^{a}\mathrm{\parallel}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}$ for all $t$ and $a\mathrm{\in}\mathrm{A}\mathrm{.}$
At each round, for the observed context, there is an optimal arm to pull: ${a}_{t}^{*}={\text{argmax}}_{a}{\mathbf{s}}_{t}^{T}{\theta}^{a}$, splitting ties arbitrarily. From this we can define the instantaneous regret:
$${r}_{t}={\mathbf{s}}_{t}^{T}({\theta}^{{a}_{t}^{*}}{\stackrel{~}{\theta}}_{t}^{{a}_{t}}).$$ 
We aim to find a policy $\pi :\mathcal{D}\to \mathcal{A}$ that minimizes the cumulative regret over $T$ rounds:
$${\widehat{R}}_{T}=\sum _{t=1}^{T}{r}_{t}.$$ 
Let ${R}_{T}=\mathbb{E}({\widehat{R}}_{T})$, where the expectation is taken over histories up to horizon $T$.
1.2 Rarelyswitching policies
The optimal policy is a multiclass classifier:
$${\pi}^{*}({\mathbf{s}}_{t})={\text{argmax}}_{a}{\mathbf{s}}_{t}^{T}{\theta}^{a}.$$  (1) 
Here we consider learning policies whose basic form is the following:
$$\pi ({\mathbf{s}}_{t})={\text{argmax}}_{a}{\mathbf{s}}_{t}^{T}{\stackrel{~}{\theta}}_{t}^{a},$$  (2) 
and thus that are parameterized by the set ${\stackrel{~}{\theta}}_{t}={\{{\stackrel{~}{\theta}}_{t}^{a}\}}_{a\in \mathcal{A}}$, which can be held constant between rounds. We call this family of policies rarelyswitching policies. They can be seen as policies in which decisions are made by thresholding on a linear combination of context variables, which is only updated rarely.
1.3 Confidence bounds
Rarelyswitching policies operate based on the provisionally optimal arm given a set of operational parameter values $\stackrel{~}{\theta}$. A policy needs to update these parameters on some schedule to be close to current estimate of the true parameters. We thus maintain the least squares estimate of parameters given the pulls of each particular arm, ${\widehat{\theta}}_{t}^{a}$. Let ${T}_{t}^{a}$ be the set of times for which arm $a$ was pulled up to and including round $t$. Then let the arm parameter estimates be given by the ridge regression solution
${V}_{t}^{a}$  $=\lambda I+{\displaystyle \sum _{s\in {T}_{t}^{a}}}{\mathbf{s}}_{s}{\mathbf{s}}_{s}^{T},$  
${\widehat{\theta}}_{t}^{a}$  $={({V}_{t}^{a})}^{1}{\displaystyle \sum _{s\in {T}_{t}^{a}}}{y}_{s}{\mathbf{s}}_{s},$ 
for regularization parameter $\lambda >0$. Let ${V}_{t}$ be the block diagonal matrix constructed from the matrices $\{{V}_{t}^{a}\}$. To construct working rarelyswitching bandit algorithms we also must construct confidence bounds for each arm. Care is needed in constructing confidence sets for these estimators because, unlike in standard least squares estimation, the sequence of observations ${\{{\mathbf{s}}_{s},{y}_{s}\}}_{s=1}^{t}$ are no longer independent and identically distributed – the action chosen depends on the history of the process, which induces correlations among ${\{{\mathbf{s}}_{\mathbf{t}}\}}_{t\in {T}_{t}^{a}}$. By using Martingale theory, we can find confidence intervals that apply for any policy Abbasiyadkori2011 ; Szepesvari2018 . Let ${\mathcal{C}}_{t}$ be the following set:
$${\mathcal{C}}_{t}=\{\mathbf{x}\in {\mathbb{R}}^{d}:{\parallel {\widehat{\theta}}_{t1}^{a}\mathbf{x}\parallel}_{{V}_{t1}}^{2}\le {\beta}_{t}\},$$ 
where ${\parallel \mathbf{x}\parallel}_{V}^{2}={\mathbf{x}}^{T}V\mathbf{x}$, for positive definite matrix $V$, and some bound ${\beta}_{t}$. We can place a lower bound on the probability that the true parameters lie in the set of confidence sets for all rounds. Let this be the event:
$${\mathcal{E}}_{t}={\cap}_{n=1}^{t}\{\theta \in {\mathcal{C}}_{n}\}.$$ 
Theorem 20.2 of Szepesvari2018 provides an explicit form of ${\beta}_{t}$ for which the event ${\mathcal{E}}_{t}$ occurs with probability at least $1\delta $ (provided as Theorem B1 in the supplementary material).
1.4 Feasible contextual linear bandits
Which rarelyswitching algorithms learn the optimal parameters? We answer this question by first studying a more general class of algorithms. Consider the general case of the $k$armed bandit, $\mathcal{A}=\{1,\mathrm{\dots},k\}$. For each observed context, ${\mathbf{s}}_{t}$, the confidence bounds define plausible intervals of payoffs for each arm. That is, assuming ${\mathcal{E}}_{t}$ occurs, the true expected reward for arm $a$ lies in the interval
$${\mathbf{s}}^{T}{\widehat{\theta}}_{t}^{a}\pm \sqrt{{\beta}_{t}}{\parallel \mathbf{s}\parallel}_{{({V}_{t}^{a})}^{1}}.$$ 
For each round, assuming the confidence bounds hold, the upper bound payoff of some arms may be below the lower bound payoff of another arm, excluding these arms from plausibly being the best choice. Call any algorithm that never plays these excluded arms a feasible algorithm. With this we can establish the following result:
Theorem 1.
With probability at least $\mathrm{1}\mathrm{}\delta $, the regret of any policy learned by a feasible algorithm is bounded by
$${\widehat{R}}_{T}\le \sqrt{32dT{\beta}_{T}\mathrm{log}\left(\frac{trace({V}_{0})+T{L}^{2}}{d{det}^{1/d}({V}_{0})}\right)}.$$ 
All proofs are provided as supplementary material. Corollary 19.3 from Szepesvari2018 can be applied and then provides:
Corollary 1.
Choosing $\delta \mathrm{=}\mathrm{1}\mathrm{/}T$, the expected regret obeys
$${R}_{T}\le Cd\sqrt{T}\mathrm{log}(TL),$$ 
for constant $C\mathrm{>}\mathrm{0}$.
The LinUCB algorithm and the greedy algorithm (${\stackrel{~}{\theta}}_{t}={\widehat{\theta}}_{t}$, see, for example, Bastani et al Bastani2017 ) are feasible algorithms.
2 Feasible, rarelyswitching bandits
We now consider feasible, rarelyswitching algorithms that only update their policy when justified. To derive them we first consider the geometry of the linear contextual bandit.
For each context, ${\mathbf{s}}_{t}$, policies of the form (1) are linear programs murty1983linear . That is, they are solutions to:
$$\underset{\mathbf{x}\in \text{conv}(\theta )}{\mathrm{max}}{\mathbf{s}}_{t}^{T}\mathbf{x},$$  (3) 
where $C\u2254\text{conv}(\theta )$ denotes the convex hull of the set of arm parameters ${\{{\theta}^{a}\}}_{a=1}^{k}$, and can equivalently be represented as a set of linear inequalities $A\mathbf{x}\le \mathbf{b}$. $C$ is a closed convex polytope in ${\mathbb{R}}^{d}$. For all ${\mathbf{s}}_{t}\in {\mathbb{R}}^{d}\setminus \{0\}$, a solution to (3) can found on one of the vertices of $\text{conv}(\theta )$ (murty1983linear , Theorem 3.3). Thus a solution to (1) is a solution to (3) and at least one solution of (3) is a solution to (1). This correspondence provides a lot of structure. For instance, it is clear that arms which are not extreme points of $C$ are never played. And, the decision boundaries of $\pi ({\mathbf{s}}_{t})$ can be seen as contexts ${\mathbf{s}}_{t}$ which define supporting hyperplanes which contain more than one extreme point (Figure 2a,b).
2.1 Uncertainty in parameters induces uncertainty in policy
The uncertainty in parameter estimates $\widehat{\theta}$ induces uncertainty in the best policy (Figure 2c). Such problems are studied in robust optimization BenTal2009 ; Bertsimas2011 ; Bertsimas2004 . The difference between a typical problem in robust optimization is that the goal here is not to find a policy that is robust to uncertainty in the constraints (e.g. that chooses the best arm assuming worstcase constraints), but rather the goal is to maintain a policy that is plausibly close to the true policy so that regret is minimized over many rounds.
An algorithm is feasible if it never plays an excluded arm. The confidence sets are constructed such that the decision boundaries lie in decision regions (Figure 2d) with high probability. For two arms, $i$ and $j$, let ${\stackrel{~}{\theta}}_{t}^{i}{\stackrel{~}{\theta}}_{t}^{j}\u2254{\psi}_{t}^{ij}$ and let ${\mathrm{\Psi}}_{t}^{ij}$ be the set difference ${\mathrm{\Psi}}_{t}^{ij}=\{\psi \in {\mathbb{R}}^{d}\psi ={\theta}^{i}{\theta}^{j},\theta \in {\mathcal{C}}_{t}\}$. With this, we can prove:
Theorem 2.
Any rarelyswitching algorithm that maintains $\mathrm{(}{\stackrel{\mathrm{~}}{\theta}}_{t}^{i}\mathrm{}{\stackrel{\mathrm{~}}{\theta}}_{t}^{j}\mathrm{)}\mathrm{\in}\text{\mathit{c}\mathit{o}\mathit{n}\mathit{e}}\mathit{}\mathrm{(}{\mathrm{\Psi}}_{t}^{i\mathit{}j}\mathrm{)}$ for all $i\mathrm{,}j\mathrm{\in}\mathrm{[}k\mathrm{]}$ and for all $t$ is a feasible algorithm.
2.2 The makenomistakes maxim
We seek rarelyswitching policies that only change when justified – when their behavior is plausibly suboptimal. The above considerations suggest that the policy need not be changed when ${\psi}_{t}^{ij}\in \text{cone}({\mathrm{\Psi}}_{t}^{ij}),\forall i,j\in [k]$. When this is not the case, how should the policy be updated? There are a few options based on the exact sense in which the policy should be conservative. We discuss two such ideas. One idea is that updates to the policy should be as small as possible to maintain feasibility, thus minimizing the chance that the update will in fact lead to higher expected regret. Such a policy seeks to minimize mistakes when changing policy. Since the magnitude of ${\psi}_{t}^{ij}$ does not matter, we can consider the smallest update to the policy to be the update that keeps ${\stackrel{~}{\theta}}_{t}\in {\mathcal{C}}_{t}$ and that minimizes the angle between ${\psi}_{t}^{ij}$ and ${\psi}_{t+1}^{ij}$.
2.3 High probability improvement in two dimensions
In the planar case ($d=2$), the idea of minimizing the change in angle of the decision boundaries can lead, in identifiable circumstances, to a decrease in the expected regret at each round (as always, with probability at least $1\delta $). The idea is that when decision boundaries are outside a plausible interval, moving the boundary to the edge of this interval only affects the behavior of the policy in a region where the arm to pull is clear – and thus the policy moves from definitely incurring regret to not necessarily incurring regret for contexts in this region (Figure 2d). The requisite assumptions and a formal statement of this result are provided as Theorem B2 in the supplementary material.
2.4 The conservative rarelyswitching bandit
Policy updates that definitely improve expected regret are harder to construct in higher dimensions ($d>2$). However we can use the same idea in a more general bandit algorithm. The idea of minimizing the angle between each decision boundary pair suggests updating ${\stackrel{~}{\theta}}_{t}$ as follows:
$${\stackrel{~}{\theta}}_{t+1}={\text{argmax}}_{\phi \in {\mathcal{C}}_{t}}\frac{{\stackrel{~}{\theta}}_{t}^{T}{F}^{T}F\phi}{\parallel F{\stackrel{~}{\theta}}_{t}\parallel \parallel F\phi \parallel},$$ 
where $F$ is the block matrix that computes the decision boundaries for each pair of arms. That is, $F\phi $ returns ${\psi}_{t}^{ij}$ for all pairs $$. We call this the conservative rarelyswitching bandit. This can be understood as a minimization of the cosine distance under inner product ${x}^{T}{F}^{T}Fy$ (see, for example, Wua ; Choi ). It could be solved with geodesicconvex routines Zhang2016c ; Ferreira2014 . Here we solve the problem using a projected gradientdescent method. The complete algorithm is detailed in the supplementary material. As a comparison, we also considered a simpler, alternative algorithm based on updating parameters simply whenever ${\stackrel{~}{\theta}}_{t}\notin {\mathcal{C}}_{t}$, instead of basing decisions on when the decision boundaries are not feasible. We compared this algorithm to the conservative rarelyswitching bandit in the supplementary material. We found that considering policies in terms of their decision boundaries does lead to fewer changes in policy, with comparable regret.
2.5 The greedy rarelyswitching bandit
An issue with the conservative update is that, by aiming to place the decision boundaries on the edge of what is feasible, they may need to be updated again in the next few rounds – contrary to the aim of having a policy that does not need to be updated excessively. An alternative is to update the parameter with greedy updates: whenever $T{\stackrel{~}{\theta}}_{t}\notin \text{cone}(T{\mathrm{\Psi}}_{t})$, then set ${\stackrel{~}{\theta}}_{t}={\widehat{\theta}}_{t}$. This modifies the algorithm in an obvious way (see supplementary material). We call this the greedy rarelyswitching bandit. It sacrifices confidence that a given update will definitely be an improvement for an algorithm that makes fewer updates to the policy.
3 Results
3.1 Simulations
To test the conservative and greedy rarelyswitching bandit algorithms we generate synthetic data from a set of randomly generated arms. Each arm parameter is sampled randomly from ${\mathbb{R}}^{d}$. Each algorithm is run for 10,000 rounds. Here we present results for $k=4$ arms and $d=5$ (additional experiments with other choices are provided in the supplementary material). At each round a context is sampled uniformly from $\mathcal{D}$, and reward is given according to the chosen arm’s mean plus Gaussian noise with standard deviation 0.1. We take $\delta =0.0001$. The bandit algorithms are compared to the LinUCB algorithm Abbasiyadkori2011 , the rarelyswitching variant of the LinUCB algorithm Abbasiyadkori2011 , the greedy least squares algorithm Bastani2017 , and the conservative LinUCB algorithm (CLUCB) Kazerouni2017 .
Both rarelyswitching (RS) algorithms perform well. First, the perstep regret $(\frac{{R}_{t}}{t})$ is lowest for the greedy algorithm, consistent with previous bandit comparisons Bietti2018 (Figure 3a). However the RS algorithms perform almost as well, followed by the LinUCB and RS LinUCB methods. The CLUCB algorithm is the slowest to learn: despite the fact that its baseline arm is set to the be the arm with the highest expected return, meaning it starts with an already good policy, ultimately the CLUCB method has the highest per step regret. Second, we see that the mean number of changes for both rarelyswitching algorithms is very low (Figure 3b). Indeed, the RSgreedy algorithm achieves low regret with as few as five average changes in policy. Both RS policies changes much less than the RS LinUCB algorithm ($\approx 250$). We also compare each algorithm’s performance relative to a policy that always chooses a fixed, ‘baseline’ arm (set to be the with the highest expected reward). By rarely updating their policy, the rarelyswitching algorithms seldom have cumulative reward below the baseline policy (Figure 3c). In contrast, the LinUCB algorithms perform below the baseline rate on average around 6% of the time, while the CLUCB algorithm, by its design, never performs below the baseline. Finally, we examine the proportion of changes which result in lower expected regret for each method (Figure 3d). Consistent with our reasoning, a higher proportion of policy changes made by the RSconservative algorithm do indeed lead to improved expected regret, compared with changes made by either RSgreedy or RSLinUCB (both around chance). Taken together, these result show that rarely updating a bandit policy can result in efficient and safe learning, with only a very small number of changes to the policy.
3.2 Real data
Next we apply the algorithms to a medical dataset: a study on the effect of highquality child care and home visits on future cognitive performance. By presenting subjects sequentially to each algorithm, we test if the algorithms can be used to allocate treatment to subjects such that as many benefit as possible – or least more than would benefit from a randomized control trial (RCT). We use a semisimulated infant health and development program (IDHP) dataset, in which counterfactuals are simulated from a learned model from the actual trial Hill2011b . This dataset has been used as a benchmark for causal learning (e.g. Shalit2016a ). The dataset consists of 747 subjects, and 100 simulated outcomes and counterfactuals. We observe that all algorithms learn an optimal treatment within the trial. However the rarelyswitching and greedy algorithms learn significantly faster than the LinUCB and rarelyswitching LinUCB (Figure 4a). Further, both rarelyswitching algorithms learns a good policy with approximately 35 changes in policy (Figure 4b). These results suggest that these algorithms can be used to efficiently learn good policies with minimal changes, even over a short trial – making them relevant for adaptive clinical trial designs.
4 Conclusion
Here we proposed an approach to optimize policies determined by a linear combination of covariates in a rarelyswitching manner. We proved that, in some cases, we can construct updates that are guaranteed to improve the expected reward. Both in simulation and real data, the methods learn as efficiently as other stateoftheart bandit algorithms. Further, the greedy rarelyswitching approach requires very few changes to reach a good policy. Though we do not perform a comprehensive comparison between other bandit algorithms, these findings are consistent with previous bandit studies. For instance, Foster et al 2018 Foster2018 finds greedy algorithms, and algorithms that maintain a set of feasible parameter regions perform well.
There are two large caveats to the rarelyswitching approach we developed. First, the algorithms are not guaranteed to converge with almost certainty. For applications where safety is important, this is an important consideration. However this shortcoming is shared by the other methods considered here. Second, in many of the cases considered, a linear model may not be reasonable. This could be addressed by considering GLMs, or by considering local regression models. Indeed, in measuring a causal effect at a threshold, RDD relies only on locally linear models. We may expect our results could be generalized to nonlinear settings by considering a local linear regression at the threshold. This is left for future work.
The rarelyswitching framework is particularly relevant to adaptive trial designs and cases where the number of arms (treatments) is small. In such cases there is a simultaneous goal of learning the treatment effect and maximizing positive outcomes. As discussed in the introduction, there is a close relationship between learning with bandit feedback and causal inference. Indeed, a number of recent studies have explored intersections of causal inference and bandit problems, considering cases where the actions of an agent are confounded Bareinboim2016 ; Forney2017 ; Zhang2017a ; Bareinboim2015 ; Lattimore2016 ; Lee2018 , or where methods from causal inference such as matching can prove useful Dimakopoulou2017 ; Offerwestort2018 . We complement these methods by showing how adaptive trials can be performed with few changes to policy.
The rarelyswitching framework is also relevant to any policy determined by a fixed threshold. A number of authors have noted that thresholds could be used a lot more in medicine to measure causal effects Moscoe2015 ; Marinescu2018 . Our work shows how these thresholds could also be optimized.
More generally, the rarelyswitching framework could be extended to policies in Markov Decision processes (MDPs). Safety in these settings is increasingly of interest Fan2019 ; Laroche2018 ; Komorowski2019 ; Carrara2019 ; Fulton2019 . We believe our approach method could be extended to make conservative changes to policies, with applications in particular to personalized medicine. Further, because a policy in deep reinforcement learning can be parameterized as a linear readout of a set of learned features Berg2016 , this approach could also be applied to deep learning and deep reinforcement learning Hechtlinger2019 . This is also left as future work.
In many real world settings, changing policies is expensive or difficult. As the use of machine learning to find optimal policies in medicine and economics increases, considering this inertia is increasingly important. Using the theory of contextual linear bandits, here we provided one of the first analyses of this problem.
References
 [1] Yasin Abbasiyadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits. NIPS 2011, pages 1–19, 2011.
 [2] Hossein Aboutalebi, Doina Precup, and Tibor Schuster. Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials. ArXiv eprints, 2019.
 [3] Yizhuang Alden Cheng. Regression Discontinuity Designs with Multiple Assignment Variables *. Technical report, 2016.
 [4] Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with Unobserved Confounders: A Causal Approach. Advances in Neural Information Processing Systems, (November):1–9, 2015.
 [5] Elias Bareinboim and Judea Pearl. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
 [6] Eleanor Barry, Samantha Roberts, Jason Oke, Shanti Vijayaraghavan, Rebecca Normansell, and Trisha Greenhalgh. Efficacy and effectiveness of screen and treat policies in prevention of type 2 diabetes : systematic review and metaanalysis of screening tests and interventions. BMJ, 356, 2017.
 [7] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly ExplorationFree Algorithms for Contextual Bandits. ArXiv eprints, pages 1–58, 2017.
 [8] Aharon BenTal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, Princeton, 2009.
 [9] E Van Den Berg. Some Insights into the Geometry and Training of Neural Networks. ArXiv eprints, (2):1–33, 2016.
 [10] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian Optimization with Safety Constraints : Safe and Automatic Parameter Tuning in Robotics. ArXiv eprints, 2016.
 [11] Marinho Bertanha. Regression Discontinuity Design with Many Thresholds. SSRN, pages 1–62, 2019.
 [12] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and Applications of Robust Optimization. SIAM Review, 53(3):464–501, 2011.
 [13] Dimitris Bertsimas and Melvyn Sim. Robust Discrete Optimization Under Ellipsoidal Uncertainty Sets. Technical Report, pages 1–23, 2004.
 [14] Alberto Bietti, Alekh Agarwal, and John Langford. A Contextual Bandit Bakeoff. ArXiv eprints, 2018.
 [15] Jacob Bor, Ellen Moscoe, Portia Mutevidzi, MarieLouise Newell, and Till Barnighausen. Regression Discontinuity Designs in Epidemiology. Epidemiology, 25(5):729–737, 2014.
 [16] Christopher Carpenter and Carlos Dobkin. The minimum legal drinking age and public health. J Econ Perspect, 25(2):133–156, 2011.
 [17] Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalricambrym Maillard, Olivier Pietquin, and Renault Group. Scaling up Budgeted Reinforcement Learning. ArXiv eprints, 2019.
 [18] Matias D. Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo VazquezBare. Extrapolating Treatment Effects in MultiCutoff Regression Discontinuity Designs. ArXiv eprints, 2019.
 [19] Matis D Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo VazquezBare. Interpreting Regression Discontinuity Designs with Multiple Cutoffs. Journal of Politics, 78(4), 2016.
 [20] Jonghyun Choi and Larry S Davis. Toward Sparse Coding on Cosine Distance. International Conference on Pattern Recognition, 22, 2014.
 [21] Yinlam Chow, Ofir Nachum, Aleksandra Faust, and Edgar Duenezguzman Mohammad. Lyapunovbased Safe Policy Optimization for Continuous Control. ArXiv eprints, 2019.
 [22] W Chu and L Li. Contextual bandits with linear payoff functions. International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 [23] Yahel David, Balázs Szörényi, Mohammad Ghavamzadeh, Shie Mannor, and Nahum Shimkin. PAC Bandits with Risk Constraints. ISAIM, 2018.
 [24] Cheryl Dennisonhimmelfarb, Joel Handler, and Daniel T Lackland. 2014 EvidenceBased Guideline for the Management of High Blood Pressure in Adults. JAMA, 311(5):507–520, 2014.
 [25] Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Estimation Consideration in Contextual Bandits. ArXiv eprints, 2017.
 [26] Yingying Dong and Arthur Lewbel. Identifying the effect of changing the policy threshold in regression discontinuity models. The review of economics and statistics, 97(December):1081–1092, 2015.
 [27] Jiameng Fan and Wenchao Li. Safetyguided deep reinforcement learning via online Gaussian process estimation. ArXiv eprints, pages 1–13, 2019.
 [28] O P Ferreira, A N Iusem, and S Z Nemeth. Concepts and techniques of optimization on the sphere. TOP, 22:1148–1170, 2014.
 [29] Sarah Filippi, Olivier Capp, Aurelien Garivier, and Csaba Szepesvari. Parametric Bandits : The Generalized Linear Case. NIPS, pages 1–9, 2010.
 [30] Andrew Forney, Judea Pearl, Elias Bareinboim, and West Lafayette. Counterfactual DataFusion for Online Reinforcement Learners. Proceedings of the 34th International Conference on Machine Learning, 70(June):1156–1164, 2017.
 [31] Dylan J Foster, Alekh Agarwal, Miroslav Dud, and Luo Robert. Practical Contextual Bandits with Regression Oracles. ArXiv eprints, 2018.
 [32] Nathan Fulton and Andre Platzer. Verifiably Safe OffModel Reinforcement Learning. ArXiv eprints, 2019.
 [33] Melody Y Guan and Heinrich Jiang. Nonparametric Stochastic Contextual Bandits. AAAI, 2018.
 [34] Sudipto Guha and Kamesh Munagala. Multiarmed Bandits with Metric Switching Costs. In Proceedings of the 36th Internatilonal Collogquium on Automata, Languages and Programming: Part II, ICALP ’09, pages 496–507, Berlin, Heidelberg, 2009. SpringerVerlag.
 [35] Yotam Hechtlinger, Barnabás Póczos, and Larry Wasserman. Cautious Deep Learning. ArXiv eprints, 2019.
 [36] Jennifer L Hill. Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
 [37] Guido Imbens and Stefan Wager. Optimized Regression Discontinuity Designs. ArXiv eprints, pages 1–22, 2017.
 [38] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. ICML, 2002.
 [39] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem. NIPS, 2018.
 [40] Sumeet Katariya, Branislav Kveton, Zheng Wen, and Vamsi K. Potluru. Conservative Exploration using Interleaving. PMLR, 89:1–16, 2019.
 [41] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin AbbasiYadkori, and Benjamin Van Roy. Conservative Contextual Linear Bandits. Advances in neural information processing, 2017.
 [42] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo. Understanding the Artificial Intelligence Clinician and optimal treatment strategies for sepsis in intensive care General comments on the safety of the AI Clinician. ArXiv eprints, pages 1–13, 2019.
 [43] Tomer Koren, Roi Livni, and Yishay Mansour. Bandits with Movement Costs and Adaptive Pricing. COLT, 2017.
 [44] Tomer Koren, Roi Livni, and Yishay Mansour. MultiArmed Bandits with Metric Movement Costs. Advances in Neural Information Processing Systems, 30, 2017.
 [45] Andreas Krause and Cheng Soon Ong. Contextual Gaussian Process Bandit Optimization. Advances in Neural Information Processing Systems, pages 1–9, 2011.
 [46] Romain Laroche and Orange Labs. Reinforcement learning algorithm selection. ICLR, 2018.
 [47] Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal Bandits: Learning Good Interventions via Causal Inference. Advances in Neural Information Processing Systems, 2016.
 [48] David S Lee. Randomized Experiments from Nonrandom Selection in US House Elections. Journal of Econometrics, 142(2), 2008.
 [49] Sanghack Lee and Elias Bareinboim. Structural Causal Bandits with Nonmanipulable Variables. AAAI, (November), 2018.
 [50] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A ContextualBandit Approach to Personalized News Article Recommendation. WWW, 2010.
 [51] Ioana E Marinescu, Patrick N Lawlor, and Konrad P Kording. Quasiexperimental causality in neuroscience and behavioral research. Nature Human Behaviour, 2(December):891–898, 2018.
 [52] Ioana Elena Marinescu, Konrad Paul Kording, Sofia Triantafillou, and Konrad Paul Kording. Regression Discontinuity Threshold Optimization. SSRN, (1):1–6, 2019.
 [53] Ellen Moscoe, Jacob Bor, and Till Bärnighausen. Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: A review of current and best practice. Journal of Clinical Epidemiology, 68(2):122–133, 2015.
 [54] Brian G Moss, William H Yeaton, and Jane E Lloyd. Evaluating the Effectiveness of Developmental Mathematics by Embedding a Randomized Experiment Within a Regression Discontinuity Design. Educational Evaluation and Policy Analysis, 36(2):170–185, 2014.
 [55] Katta G Murty. Linear programming. Springer, 1983.
 [56] Matthew Nayor, Ramachandran S Vasan, and Heart Study. Recent update to the US Cholesterol Treatment Guidelines: A comparison with international guidelines. Circulation, 133(18):1795–1806, 2017.
 [57] Molly Offerwestort and Alexander Coppock. Adaptive Experimental Design : Prospects and Applications in Political Science. Annual Meeting of the American Political Science Association, 2018.
 [58] Ankur Pandya, Stephen Sy, Sylvia Cho, Milton C Weinstein, and Thomas A Gaziano. Costeffectiveness of 10year risk thresholds for initiation of statin therapy for primary prevention of cardiovascular disease. JAMA, 314(2):142–150, 2015.
 [59] John P Papay, John B Willett, and Richard J Murnane. Extending the regressiondiscontinuity approach to multiple assignment variables. Journal of Econometrics, 161(2):203–207, 2011.
 [60] Judea Pearl. Causality: models, reasoning and inference. Cambridge Univ Press, 2000.
 [61] Paat Rusmevichientong and John N Tsitsiklis. Linearly Parameterized Bandits. ArXiv eprints, 2010.
 [62] Amir Sani, Alessandro Lazaric, and Rémi Munos. RiskAversion in Multiarmed Bandits. Advances in neural information processing, pages 1–20, 2013.
 [63] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect : generalization bounds and algorithms. ArXiv eprints, 0:1–22, 2016.
 [64] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ArXiv eprints, 2009.
 [65] Wen Sun, Debadeepta Dey, and Ashish Kapoor. RiskAware Algorithms for Adversarial Contextual Bandits. ArXiv eprints, 4:1–28, 2016.
 [66] Csaba Szepesvari and Tor Lattimore. Bandit Algorithms. Cambridge University Press, page 542, 2018.
 [67] Donald L Thistlewaite, Donald T Campbell, Peter Aronow, Nicole Basta, Betz Halloran, Matias Cattaneo, Gonzalo VazquezBare, Guido Imbens, AlessanDra Mattei, Fabrizia Mealli, Jasjeet Sekhon, Rocío Titiunik, Vivian Wong, and Coady Wing. Regressiondiscontinuity Analysis: An alternative to the expost facto experiment. Journal of Educational Psychology, 51:309–317, 1960.
 [68] Sattar Vakili and Qing Zhao. RiskAverse MultiArmed Bandit Problems under MeanVariance Measure. IEEE Journal on Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
 [69] Vivian Wong, Peter Steiner, and Thomas Cook. Analyzing RegressionDiscontinuity Designs With Multiple Assignment Variables : A Comparative Study of Four Estimation Methods. Journal of Educational and Behavioral Statistics, 38(2):107–141, 2013.
 [70] Vivian C Wong, Peter M Steiner, and Thomas D Cook. Analyzing RegressionDiscontinuity Designs with Multiple Assignment Variables : A Comparative Study of Four Estimation Methods. 2010.
 [71] Xiang Wu, Zhiguo Shi, and Lei Liu. Quasi Cosine Similarity Metric Learning. In C Jawahar and S Shan, editors, Lecture Notes in Computer Science, vol 9010. 2014.
 [72] Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative Bandits. ICML, 48, 2016.
 [73] Hongyi Zhang and Suvrit Sra. Firstorder Methods for Geodesically Convex Optimization. JMLR, 49(1):1–22, 2016.
 [74] Junzhe Zhang and Elias Bareinboim. Transfer learning in multiarmed bandits: A Causal approach. IJCAI International Joint Conference on Artificial Intelligence, pages 1340–1346, 2017.
 [75] Li Zhou. A Survey on Contextual Multiarmed Bandits. ArXiv eprints, 2015.
 [76] Alexander Zimin, Rasmus Ibsenjensen, Krishnendu Chatterjee, and L G May. Generalized RiskAversion in Stochastic MultiArmed Bandits. ArXiv eprints, 2014.
Supplementary material
Appendix A Algorithms
We implement a projected gradient method. To solve
$${\stackrel{~}{\theta}}_{t+1}={\text{argmax}}_{\phi \in {\mathcal{C}}_{t}}\frac{{\stackrel{~}{\theta}}_{t}^{T}{F}^{T}F\phi}{\parallel F{\stackrel{~}{\theta}}_{t}\parallel \parallel F\phi \parallel},$$ 
the method consists of alternating ${n}_{iter}$ times between gradient updates to maximize:
$$K(\phi )=\frac{{\stackrel{~}{\theta}}_{t}^{T}{F}^{T}F\phi}{\parallel F{\stackrel{~}{\theta}}_{t}\parallel \parallel F\phi \parallel},$$  (4) 
followed by projections onto the convex set ${\mathcal{C}}_{t}$. Let $J={F}^{T}F$. Then the gradient with respect to $\phi $ is:
$$\frac{\partial K}{\partial \phi}(\phi )=\frac{1}{\parallel F{\stackrel{~}{\theta}}_{t}\parallel {({\phi}^{T}J\phi )}^{3/2}}\left(J\phi ({\stackrel{~}{\theta}}^{T}J\phi )J\stackrel{~}{\theta}({\phi}^{T}J\phi )\right).$$ 
We decide to update the parameters if the smallest angle between ${\stackrel{~}{\theta}}_{t1}$ and $\phi \in {\mathcal{C}}_{t}$ is above some threshold, $\text{arccos}(\mathrm{\Delta})$, and if the parameters ${\stackrel{~}{\theta}}_{t1}$ are indeed not in the current confidence set ${\mathcal{C}}_{t}$. The algorithm has parameters, ${n}_{iter}$, step size $\u03f5$, and a tolerance at which we decide two vectors have the same angle, $\mathrm{\Delta}$. Here we use ${n}_{iter}=100$, $\u03f5=0.1$ and $\mathrm{\Delta}=0.01$. The algorithm was tested for robustness to the choice in $\mathrm{\Delta}$ – values of $\mathrm{\Delta}$ between ${10}^{1}$ and ${10}^{4}$ do not affect the regret behavior (Supplementary Figure 5).
{algorithmic} \Requireinitial policy ${\stackrel{~}{\theta}}_{0}$, regularization parameter $\lambda $, number of rounds $T$, initial gradient step size $\u03f5$, tolerance $\mathrm{\Delta}$ \StateInitialize ${V}_{0}\leftarrow \lambda I$ \For$t\in [0,T]$ \StateObserve context: ${\mathbf{s}}_{t}$ \StateChoose action: ${a}_{t}\leftarrow {\text{argmax}}_{a}{\mathbf{s}}_{t}^{T}{\stackrel{~}{\theta}}_{t1}$ and obtain reward ${y}_{t}$ \StateUpdate parameters: ${V}_{t}$, ${U}_{t}$, ${\widehat{\theta}}_{t}$, ${\beta}_{t}$ and bounds ${\mathcal{C}}_{t}$ \State${\phi}_{0}\leftarrow {\stackrel{~}{\theta}}_{t1}$ \For$i\in [1,{n}_{iter}]$ \StateSet step size: $\eta \leftarrow \u03f5/\sqrt{i}$ \StateGradient step, using Suppl. Eq. (4): ${\phi}^{\prime}\leftarrow {\phi}_{i1}+\eta \frac{\partial K}{\partial \phi}({\phi}_{i1})$ \StateProject onto ${\mathcal{C}}_{t}$: ${\phi}_{i}\leftarrow {\text{argmin}}_{\varphi \in {\mathcal{C}}_{t}}{\parallel \varphi {\phi}^{\prime}\parallel}_{{V}_{t}}^{2}$ \EndFor\Stateboundary_angles $\leftarrow \frac{{\stackrel{~}{\theta}}_{t1}^{T}J{\phi}_{{n}_{iter}}}{{\parallel {\stackrel{~}{\theta}}_{t1}\parallel}_{J}{\parallel {\phi}_{{n}_{iter}}\parallel}_{J}}$ \Ifboundary_angles $$ and ${\parallel {\stackrel{~}{\theta}}_{t1}{\widehat{\theta}}_{t}\parallel}_{{V}_{t}}>{\beta}_{t}$ \State${\stackrel{~}{\theta}}_{t}\leftarrow {\phi}_{{n}_{iter}}$ \Else\State${\stackrel{~}{\theta}}_{t}\leftarrow {\stackrel{~}{\theta}}_{t1}$ \EndIf\EndFor
The greedy algorithm only changes the last section of the algorithm. When an updated is required, it sets ${\stackrel{~}{\theta}}_{t}={\widehat{\theta}}_{t}$.
{algorithmic} \Requireinitial policy ${\stackrel{~}{\theta}}_{0}$, regularization parameter $\lambda $, number of rounds $T$, initial gradient step size $\u03f5$, tolerance $\mathrm{\Delta}$ \StateInitialize ${V}_{0}\leftarrow \lambda I$ \For$t\in [0,T]$ \StateObserve context: ${\mathbf{s}}_{t}$ \StateChoose action: ${a}_{t}\leftarrow {\text{argmax}}_{a}{\mathbf{s}}_{t}^{T}{\stackrel{~}{\theta}}_{t1}$ and obtain reward ${y}_{t}$ \StateUpdate parameters: ${V}_{t}$, ${U}_{t}$, ${\widehat{\theta}}_{t}$, ${\beta}_{t}$ and bounds ${\mathcal{C}}_{t}$ \State${\phi}_{0}\leftarrow {\stackrel{~}{\theta}}_{t1}$ \For$i\in [1,{n}_{iter}]$ \StateSet step size: $\eta \leftarrow \u03f5/\sqrt{i}$ \StateGradient step, using Equation (4): ${\phi}^{\prime}\leftarrow {\phi}_{i1}+\eta \frac{\partial K}{\partial \phi}({\phi}_{i1})$ \StateProject onto ${\mathcal{C}}_{t}$: ${\phi}_{i}\leftarrow {\text{argmin}}_{\varphi \in {\mathcal{C}}_{t}}{\parallel \varphi {\phi}^{\prime}\parallel}_{{V}_{t}}^{2}$ \EndFor\Stateboundary_angles $\leftarrow \frac{{\stackrel{~}{\theta}}_{t1}^{T}J{\phi}_{{n}_{iter}}}{{\parallel {\stackrel{~}{\theta}}_{t1}\parallel}_{J}{\parallel {\phi}_{{n}_{iter}}\parallel}_{J}}$ \Ifboundary_angles $$ and ${\parallel {\stackrel{~}{\theta}}_{t1}{\widehat{\theta}}_{t}\parallel}_{{V}_{t}}^{2}>{\beta}_{t}$ \State${\stackrel{~}{\theta}}_{t}\leftarrow {\widehat{\theta}}_{t}$ \Else\State${\stackrel{~}{\theta}}_{t}\leftarrow {\stackrel{~}{\theta}}_{t1}$ \EndIf\EndFor
Appendix B Proofs of Theorems
The following is a restatement of Theorem 20.2 from [66], and provides an explicit form for the size of the uncertainty sets. We omit the proof here.
Theorem B1.
For any $\delta \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, with probability at least $\mathrm{1}\mathrm{}\delta $, it holds that for all $t\mathrm{\in}{\mathrm{N}}_{\mathrm{+}}$,
$$ 
Also, for $\mathrm{\parallel}\theta \mathrm{\parallel}\mathrm{\le}L$ then $\mathrm{P}\mathit{}\mathrm{(}{\mathrm{E}}_{t}\mathrm{)}\mathrm{\ge}\mathrm{1}\mathrm{}\delta $ with ${\mathrm{C}}_{t}$ defined using
$${\beta}_{t}=\sqrt{\lambda}L+\sqrt{2\mathrm{log}\left(\frac{1}{\delta}\right)+\mathrm{log}\left(\frac{det{V}_{t}}{{\lambda}^{d}}\right)}.$$ 
Theorem 1.
With probability at least $\mathrm{1}\mathrm{}\delta $, the regret of any policy learned by a feasible algorithm is bounded by
$${\widehat{R}}_{T}\le \sqrt{32dT{\beta}_{T}\mathrm{log}\left(\frac{trace({V}_{0})+T{L}^{2}}{d{det}^{1/d}({V}_{0})}\right)}.$$ 
Proof.
For a given context ${\mathbf{s}}_{t}$ let $\mathcal{J}({\mathbf{s}}_{t})\subseteq \mathcal{A}$ denote the set of valid arms that are not excluded from being played according to their confidence bounds. Assuming the confidence bounds hold, then if $\mathcal{J}({\mathbf{s}}_{t})$ contains only one arm then a feasible algorithm incurs no regret in that round. Otherwise, let ${i}^{*}$ and ${j}^{*}$ be the pair of valid arms with the largest difference in payout for ${\mathbf{s}}_{t}$:
$${i}^{*},{j}^{*}={\text{argmax}}_{i,j\in \mathcal{J}({\mathbf{s}}_{t})}{\mathbf{s}}_{t}^{T}({\theta}^{i}{\theta}^{j}).$$ 
The optimal arm is one of the arms in $\mathcal{J}({\mathbf{s}}_{t})$, as is the choice of a feasible algorithm, thus this is a bound for the regret in round $t$. Without loss of generality, let ${i}^{*}$ be the arm with the higher estimated payoff for context ${\mathbf{s}}_{t}$. Since ${j}^{*}$ is not excluded, the lower bound of arm ${i}^{*}$ is below the upper bound of arm ${j}^{*}$. This means that:
${\mathbf{s}}_{t}{\widehat{\theta}}_{t}^{{i}^{*}}\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}$  $\le {\mathbf{s}}_{t}{\widehat{\theta}}_{t}^{{j}^{*}}+\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}$  
$\Rightarrow \left{\mathbf{s}}_{t}({\widehat{\theta}}_{t}^{{i}^{*}}{\widehat{\theta}}_{t}^{{j}^{*}})\right$  $\le \sqrt{{\beta}_{t}^{{i}^{*}}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}+\sqrt{{\beta}_{t}^{{j}^{*}}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}$  (5) 
Thus:
${r}_{t}$  $\le {\mathbf{s}}_{t}^{T}({\theta}^{{i}^{*}}{\theta}^{{j}^{*}})$  
$\le {\mathbf{s}}_{t}^{T}({\theta}^{{i}^{*}}{\widehat{\theta}}_{t}^{i})+{\mathbf{s}}_{t}^{T}({\theta}^{{j}^{*}}{\widehat{\theta}}_{t}^{{j}^{*}})+{\mathbf{s}}_{t}^{T}({\widehat{\theta}}_{t}^{{i}^{*}}{\widehat{\theta}}_{t}^{{j}^{*}})$  
$\le {\mathbf{s}}_{t}^{T}({\theta}^{{i}^{*}}{\widehat{\theta}}_{t}^{{i}^{*}})+{\mathbf{s}}_{t}^{T}({\theta}^{{j}^{*}}{\widehat{\theta}}_{t}^{{j}^{*}})+\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}+\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}$  
$\le {\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}{\parallel ({\theta}^{i}{\widehat{\theta}}_{t}^{{i}^{*}})\parallel}_{{V}_{t}^{{i}^{*}}}+{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}{\parallel ({\theta}^{{j}^{*}}{\widehat{\theta}}_{t}^{{j}^{*}})\parallel}_{{V}_{t}^{{j}^{*}}}+\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}+\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}$  
$\le 2\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{i}^{*}})}^{1}}+2\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t}^{{j}^{*}})}^{1}}$  
$\le 4\sqrt{{\beta}_{t}}{\parallel {\mathbf{s}}_{t}\parallel}_{{({V}_{t})}^{1}},$ 
using: Equation (5); the CauchySchwarz inequality; the fact that $\theta \in {\mathcal{C}}_{t}$; and the fact that ${\parallel \mathbf{s}\parallel}_{{({V}_{t})}^{1}}^{2}={\sum}_{i=1}^{k}{\parallel \mathbf{s}\parallel}_{{({V}_{t}^{i})}^{1}}^{2}$, and thus that ${\parallel \mathbf{s}\parallel}_{{({V}_{t}^{i})}^{1}}\le {\parallel \mathbf{s}\parallel}_{{({V}_{t})}^{1}}$ for any $i$. Up to a factor of 2, this is exactly the form of the regret bound obtained in the proof of Theorem 19.2 [66]. Thus the rest of the proof proceeds identically from there and establishes the result. ∎
Theorem 2.
Any rarelyswitching algorithm that maintains $\mathrm{(}{\stackrel{\mathrm{~}}{\theta}}_{t}^{i}\mathrm{}{\stackrel{\mathrm{~}}{\theta}}_{t}^{j}\mathrm{)}\mathrm{\in}\text{\mathit{c}\mathit{o}\mathit{n}\mathit{e}}\mathit{}\mathrm{(}{\mathrm{\Psi}}_{t}^{i\mathit{}j}\mathrm{)}$ for all $i\mathrm{,}j\mathrm{\in}\mathrm{[}k\mathrm{]}$ and for all $t$ is a feasible algorithm.
Proof.
A feasible algorithm never plays an excluded arm. This can be judged on a pairwise basis: if the expected payoff for arm $j$ is lower than that of any other arm $i$, then $j$ is excluded. For a given context ${\mathbf{s}}_{t}$, all excluded arms can be determined by considering all pairs of arms. Thus we need only focus on characterizing the decision boundaries between pairs of arms. The confidence sets are constructed such that these decision boundaries lie in decision regions with high probability. For two arms, $i$ and $j$, let ${\mathrm{\Psi}}_{t}^{ij}$ be the set difference ${\mathrm{\Psi}}_{t}^{ij}=\{\psi \in {\mathbb{R}}^{d}\psi ={\theta}^{i}{\theta}^{j},\text{with}\theta \in {\mathcal{C}}_{t}\}$.
Then the decision boundary between arms $i$ and $j$ lies in the set:
$${\mathcal{D}}_{t}^{ij}=\{\mathbf{s}\in {\mathbb{R}}^{d}{\mathbf{s}}^{T}{\psi}^{ij}=0,{\psi}^{ij}\in {\mathrm{\Psi}}_{t}^{ij}\}.$$ 
This region can be understand as the complement of the union of two sets: the dual cone of ${\mathrm{\Psi}}_{t}^{ij}$ and the polar cone of ${\mathrm{\Psi}}_{t}^{ij}$. The dual cone corresponds to contexts in which, for any set of feasible parameters $\theta \in {\mathcal{C}}_{t}$, arm $i$ is better: ${C}_{t}^{*ij}=\{\mathbf{s}\in {\mathbb{R}}^{d}{\mathbf{s}}^{T}{\psi}^{ij}>0,\forall {\psi}^{ij}\in {\mathrm{\Psi}}_{t}^{ij}\}$. The polar cone, ${D}_{t}^{*ij}\u2254{C}_{t}^{*ij}$ is the opposite: it represents contexts for which, given any combination of feasible parameters, arm $j$ is better.
If ${\stackrel{~}{\theta}}^{i}{\stackrel{~}{\theta}}^{j}\in \text{cone}({\mathrm{\Psi}}_{t}^{ij})$ then $\exists \alpha >0$ such that
$${\stackrel{~}{\theta}}^{i}{\stackrel{~}{\theta}}^{j}=\alpha ({\theta}^{i}{\theta}^{j})$$ 
for some ${\theta}^{i},{\theta}^{j}\in {\mathcal{C}}_{t}$. This means there exists feasible parameters that have the same decision boundary as ${\stackrel{~}{\theta}}^{i},{\stackrel{~}{\theta}}^{j}$. This means that, for all $\mathbf{s}\in {C}_{t}^{*ij}$, where arm ${\theta}^{i}$ is better, the policy $\stackrel{~}{\theta}$ will not play arm $j$ (the excluded arm). Thus $\stackrel{~}{\theta}$ does not play any excluded arm, and is a feasible algorithm. Note that if $0\in {\mathrm{\Psi}}_{t}^{ij}$ then any decision boundary between $i$ and $j$ is plausible, and this pair of arms does not constrain if $\stackrel{~}{\theta}$ is feasible or not.
∎
We can prove in $d=2$ there are updates that always improve the expected regret. The intuition is that, in two dimensions, the policy ${\stackrel{~}{\theta}}_{t}$ defines intervals over angles in which each arm is played. The problem thus becomes one dimensional – it can be considered as a cyclic ordering of the arms based on angle, $\sigma =(\sigma ({\stackrel{~}{\theta}}_{1})\mathrm{\dots}\sigma ({\stackrel{~}{\theta}}_{k}))$, and a set of angles corresponding to the decision boundaries between the pair of arms ${\stackrel{~}{\theta}}^{{\sigma}_{i}}$ and ${\stackrel{~}{\theta}}^{{\sigma}_{i+1}}$. Then we have:
Theorem B2.
For $d\mathrm{=}\mathrm{2}$, if an update to the policy ${\stackrel{\mathrm{~}}{\theta}}_{t}$ satisfies:

1.
the order of the arms $\sigma $ is the same for ${\stackrel{~}{\theta}}_{t}$ and ${\stackrel{~}{\theta}}_{t+1}$,

2.
each new decision boundary ${\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$ stays within the next and previous decision boundaries given by ${\stackrel{~}{\theta}}_{t}$:
$${\psi}_{t}^{{\sigma}_{i1},{\sigma}_{i}}\le {\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t}^{{\sigma}_{i+1},{\sigma}_{i+2}},$$ 
3.
each change ${\psi}_{t}^{ij}\to {\psi}_{t+1}^{ij}$ is the smallest such that ${\psi}_{t+1}^{ij}\in \text{\mathit{c}\mathit{o}\mathit{n}\mathit{e}}({\mathrm{\Psi}}_{t+1}^{ij})$. I.e. that
$${\psi}_{t+1}^{ij}={\text{\mathit{a}\mathit{r}\mathit{g}\mathit{m}\mathit{i}\mathit{n}}}_{\psi \in \text{\mathit{c}\mathit{o}\mathit{n}\mathit{e}}({\mathrm{\Psi}}_{t+1}^{ij})}\text{\mathit{a}\mathit{r}\mathit{c}\mathit{c}\mathit{o}\mathit{s}\mathit{d}\mathit{i}\mathit{s}\mathit{t}}(\psi ,{\psi}_{t}^{ij}),$$ 
4.
If ${\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$, then in each interval $[{\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}},{\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}]$, there are no plausible decision boundaries that involve ${\sigma}_{i}$ (and ${\sigma}_{i+1}$ if instead ${\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}>{\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$)
then it does not increase the expected instantaneous regret, for a fixed policy, in the sense that:
$${\mathbb{E}}_{\mathbf{s}\sim \rho}\left({\mathbf{s}}^{T}{\stackrel{~}{\theta}}_{t}^{{a}_{t}}\right)\le {\mathbb{E}}_{\mathbf{s}\sim \rho}\left({\mathbf{s}}^{T}{\stackrel{~}{\theta}}_{t+1}^{{a}_{t+1}}\right),$$ 
with probability at least $\mathrm{1}\mathrm{}\delta $.
Proof.
In the planar case, edges constitute the faces of the polytope, thus specifying the decision boundaries (edges) is the same as the Hrepresentation of the polytope – it uniquely specifies the polytope. Thus we can equally think of dealing with vertices (arm parameters), or decision boundaries. Here we deal entirely with the decision boundaries.
The assumptions on the change in policy implicitly assume that the number of playable arms stays the same between $t$ and $t+1$ – that is, an arm does not become dominated (stops being an extreme point of the polytope), or stop being dominated. Note also that if $0\in {\mathrm{\Psi}}_{t+1}^{ij}$ then any boundary ${\psi}_{t+1}^{ij}$ is feasible, and this does not contribute to changing the policy.
We prove the result by considering each interval $[{\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}},{\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}]$. By the assumptions of the update, no interval overlaps with any other interval. This is because, by the conditions of the theorem, the new border is the first boundary encountered. That is, we can speak about the new boundary ${\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$ as being either before or after the prior boundary ${\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}$. If after, then both of the following are true:
${\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}$  $\le {\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t}^{{\sigma}_{i+1},{\sigma}_{i+2}}\mathit{\hspace{1em}}(\text{condition ii})$  
${\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}$  $\le {\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t+1}^{{\sigma}_{i+1},{\sigma}_{i+2}}\mathit{\hspace{1em}}(\text{condition i})$ 
and if before, then
${\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$  $\le {\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t}^{{\sigma}_{i+1},{\sigma}_{i+2}}\mathit{\hspace{1em}}(\text{ordering})$  
${\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}$  $\le {\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}}\le {\psi}_{t+1}^{{\sigma}_{i+1},{\sigma}_{i+2}}\mathit{\hspace{1em}}(\text{condition ii})$ 
In either case, the next interval $[{\psi}_{t}^{{\sigma}_{i+1},{\sigma}_{i+2}},{\psi}_{t+1}^{{\sigma}_{i+1},{\sigma}_{i+2}}]$ does not overlap with the current interval $[{\psi}_{t}^{{\sigma}_{i},{\sigma}_{i+1}},{\psi}_{t+1}^{{\sigma}_{i},{\sigma}_{i+1}}]$. The same argument applies to the previous interval. Thus all such intervals do not overlap with one another. Since the behavior of ${\stackrel{~}{\theta}}_{t}$ and ${\stackrel{~}{\theta}}_{t+1}$ is the same outside of these intervals then to establish that the change in policy does not incur more regret we just need to study the behavior of the policy in each of these intervals separately.
So, for contexts in a given interval, between $t$ and $t+1$, the policy switches from playing ${\sigma}_{i+1}$ to ${\sigma}_{i}$ (or vice versa, depending on the ordering). Assume the policy switches from ${\sigma}_{i+1}$ to ${\sigma}_{i}$; the same argument applies to the opposite case. Either the interval is empty, and the policy behaves the same between $t$ and $t+1$, or by assumption the decision boundary is moved just to the edge of the feasible boundary between arms ${\sigma}_{i}$ and ${\sigma}_{i+1}$.
Assume the policy is moved just to the edge of the feasible boundary between the arms. We need to consider two scenarios, either it is plausible a true decision boundary between ${\sigma}_{i+1}$ and another arm lies in the interval, or it does not.
By condition iv, there cannot exist a plausible true decision boundary between ${\sigma}_{i+1}$ and another arm. Further, since the movement to the boundary of the feasible region means that the interval cannot contain the true decision boundary ${\sigma}_{i},{\sigma}_{i+1}$. Thus, by the conditions, in the interval there is no way that playing ${\sigma}_{i+1}$ was feasible. The old policy played ${\sigma}_{i+1}$ in this interval, and so only incurs regret. By changing the policy to play ${\sigma}_{i}$ in this interval, expected regret cannot increase. ∎
Remark: The assumptions are not likely to be true at the start of the trial, where the ordering of the arms is unclear and the uncertain policy regions are not isolated from one another. However these assumptions may be reasonable later, in which case they provide a guarantee that the bandit never decreases its performance. Each condition is easy to check, thus the result provides a makenomistakes algorithm in 2D.
Appendix C Extra simulation results
We perform the same simulations as in the main text, except vary the number of dimensions $d$ and the number of arms $k$. We find similar results for $d\ge 5$. For $d=2$, the rarely switching bandits do not perform as well as LinUCB (Supplementary Figure 6).
Appendix D Comparison to other possible feasible algorithms
A simpler algorithm worth considering is just to take:
$${\stackrel{~}{\theta}}_{t}={\text{argmin}}_{\phi \in {\mathcal{C}}_{t}}{\parallel \phi {\stackrel{~}{\theta}}_{t1}\parallel}_{{V}_{t}}.$$ 
That is, as soon as the parameters ${\stackrel{~}{\theta}}_{t}$ are not feasible then update $\stackrel{~}{\theta}$. This will result in more updates than the rarelyswitching algorithms we propose, which decide when to update based on the decision boundaries being not feasible, as opposed to the arm parameters. A greedy version of this would be to take:
$${\stackrel{~}{\theta}}_{t}=\{\begin{array}{cc}{\stackrel{~}{\theta}}_{t1},\hfill & {\stackrel{~}{\theta}}_{t1}\in {\mathcal{C}}_{t}\hfill \\ {\text{argmin}}_{\phi \in {\mathcal{C}}_{t}}{\parallel \phi {\stackrel{~}{\theta}}_{t1}\parallel}_{{V}_{t}},\hfill & \text{else}\hfill \end{array}$$ 
To distinguish these algorithms from the rarelyswitching versions, call these the feasible conservative, and feasible greedy algorithms, respectively.
We compare these algorithms to the rarelyswitching algorithms on the same synthetic set of linear bandits as in the main text and above. These feasible algorithms have good perstep regret on the same simulated tasks (Supplementary Figure 7a). However, the number of changes to the policy is greater than the rarelyswitching conservative and greedy versions (Supplementary Figure 7b). Thus considering only updating policy when the decision boundary becomes infeasible, as opposed to the arm parameters, is worthwhile if the goal is to minimize policy changes. Further, the feasibleconservative algorithm does not produce updates which have a good chance of leading to decreased regret, in contrast to the rarelyswitching conservative algorithm (Supplementary Figure 7d).