Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret

  • 2019-04-12 16:32:17
  • Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, Liwei Wang
  • 1


We study the communication complexity of distributed multi-armed bandits(MAB) and distributed linear bandits for regret minimization. We proposecommunication protocols that achieve near-optimal regret bounds and result inoptimal speed-up under mild conditions. We measure the communication cost ofprotocols by the total number of communicated numbers. For multi-armed bandits,we give two protocols that require little communication cost, one isindependent of the time horizon $ T $ and the other is independent of thenumber of arms $ K $. In particular, for a distributed $K$-armed bandit with$M$ agents, our protocols achieve near-optimal regret $O(\sqrt{MKT\log T})$with $O\left(M\log T\right)$ and $O\left(MK\log M\right)$ communication costrespectively. We also propose two protocols for $d$-dimensional distributedlinear bandits that achieve near-optimal regret with $O(M^{1.5}d^3)$ and$O\left((Md+d\log\log d)\log T\right)$ communication cost respectively. Thecommunication cost can be independent of $T$, or almost linear in $d$.


Quick Read (beta)

Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret

Yuanhao Wang These two authors contributed [email protected] Institute for Interdisciplinary Information Sciences, Tsinghua University Jiachen Hu* [email protected] School of Electronics Engineering and Computer Science, Peking University Xiaoyu Chen [email protected] Key Laboratory of Machine Perception, MOE, School of EECS, Peking University Liwei Wang [email protected]

We study the communication complexity of distributed multi-armed bandits (MAB) and distributed linear bandits for regret minimization. We propose communication protocols that achieve near-optimal regret bounds and result in optimal speed-up under mild conditions. We measure the communication cost of protocols by the total number of communicated numbers. For multi-armed bandits, we give two protocols that require little communication cost, one is independent of the time horizon T and the other is independent of the number of arms K. In particular, for a distributed K-armed bandit with M agents, our protocols achieve near-optimal regret O(MKTlogT) with O(MlogT) and O(MKlogM) communication cost respectively. We also propose two protocols for d-dimensional distributed linear bandits that achieve near-optimal regret with O(M1.5d3) and O((Md+dloglogd)logT) communication cost respectively. The communication cost can be independent of T, or almost linear in d.


ltexReferences \newcolumntype"@     

1 Introduction

Bandit learning is a central topic in online learning, and has been applied to various real-world tasks including clinical trials [28], model selection [21], recommendation systems [3, 19, 2], etc. In many tasks using bandits, it may be appealing to employ more agents to learn collaboratively and concurrently in order to speed up the learning process. In some other tasks, it is only natural to consider a distributed setting; for instance, multiple geographically separated labs may be working on a same clinical trial. In such distributed applications, communication between agents is critical, but may also be expensive or time-consuming. This motivates us to consider efficient protocols for distributed learning in bandit problems.

A straightforward communication protocol would be immediate sharing: broadcasting every new sample immediately. Under this scheme, agents can have performance close to that in a centralized setting. However in this case, the amount of communicated data is directly proportional to the total size of collected data. When the bandits is played for a long timescale, or when the size of a single sample is large, the cost of communication would render this scheme impractical. On the other hand, if agents do not communicate at all, their performance is guaranteed to be unsatisfactory. A natural question to ask is: how does the cost of communication and learning performance trade-off?

The bandit learning problems that we consider include stochastic multi-armed bandits (MAB) and stochastic linear bandits. In the multi-armed bandits setting, the agent is given K-arms, each of which associated with a fixed reward distribution. The goal of the agent is to obtain highest possible reward within T total pulls and thereby minimize regret. The linear bandits setting is an extension of MAB. Each arm is a vector xd, and the expected reward for arm x is linear with respect to x. We consider regret minimization in distributed MAB as regret is arguably the most common performance measure for bandit problems. On the other hand, regret bounds can be reduced to sample complexity bounds for identifying ϵ-optimal arms with high probability.

In this paper, we study the distributed learning of both stochastic multi-armed bandits and stochastic linear bandits. In particular, M agents will interact with the same bandit instance in a synchronized fashion. We use the worst-case total regret of M agents in T timesteps as the performance criterion, which we compare with that of a centralized algorithm running for MT timesteps. In particular, for each task, we propose and analyze two protocols with different styles. Protocol 1 and Protocol 3 deal with distributed MAB, while Protocol 2 and Protocol 4 are designed for linear bandits. Protocol 1 and Protocol 2 are inspired by the UCB algorithm [6], and more generally, the optimism in the face of uncertainty principle. Protocol 3 and 4 are based on the phased-elimination algorithm for MAB [7] and for linear bandits [18].

We find that in both tasks, agents can achieve regret bounds that differ from the optimum by only logarithmic factors using very little communication. More specifically, for a K-armed bandits, only O(Mmin{logT,KlogM}) numbers need to be communicated to achieve near-optimal regret. For a d-dimensional linear bandit, O(min{M1.5d3,MdloglogdlogT}) numbers need to be communicated. This suggest that agents may not need to trade performance for communication-efficiency: communication is quite efficient even when regret is near-optimal. In fact, we also provide a communication lower bound for both problems, which suggests that even if we are willing to accept a slightly worse regret, not much can be gained in communication complexity, especially in the MAB problem.

1.1 Problem Setting

Communication Model

The communication model we employ here consists of a server and several agents. The agents can communicate with the server via uploading or receiving data packets, but they are not allowed to boardcast messages directly to all other agents. We assume that the communication between the agents and the server have zero latency; that is, agents and the server can perform arbitrary amount of communication between time steps. Note that protocols in our model can be easily adopted to a serverless network, by designating an agent as the server.

The data packets communicated between the server and agents contain integers or real numbers. We define the cost of sending an integer or a real number to be O(1). The communication complexity of a protocol is defined as the total cost throughout its execution (i.e. total number of communicated integers and real numbers). We expect that for our protocols, communication complexity would only differ from the total number of communicated bits by a at most logarithmic factor, as O(log(KMT)) (or O(log(dMT))) bits for each real number should provide enough precision.11 1 We believe this is reasonable since the communicated real numbers are continuous functions of local variables, and are only used to evaluate continuous functions. Informally speaking, we do not communicate “unnatural” real numbers, such as encoding several numbers into one.

Distributed Multi-armed Bandits

In distributed multi-armed bandits, there are M agents, labelled by 1,…,M. Each agent is given access to the same stochastic K-armed bandits instance. Each arm i in the instance is related to a distribution 𝒫i. 𝒫i is supported on [0,1] with mean μ(i). At any time step t=1,2,3,,T, each agent chooses an arm at,i and pulls this arm, then he would immediately obtain a reward rt,i independently sampled from 𝒫at,i. The goal of each agent is to minimize their cumulative regret, which is defined as

Distributed Linear Bandits

In distributed linear bandits, the agents are given access to the same d-dimension stochastic linear bandits instance. In particular, agents are given the same action set 𝒟d from time to time. At time step t, agent i may choose action xt,i𝒟 and observe reward yt,i. We assume that the mean of his reward is decided by an unknown parameter θ*d: yt,i=xt,iTθ*+ηt,i, where ηt,i[-1,1] have zero mean and are independent with each other. θ* is the same for all agents. We assume θ*2S. For distributed linear bandits, cumulative regret is defined as the sum of individual agent’s regrets:


In both distributed multi-armed bandits and distributed linear bandits task, our goal is to achieve an upper bound on the minimax regret that is close to that of a centralized algorithm (which is clearly the optimum) using as little communication as possible.

We are mainly interested in the case where T is the dominant factor (compared to M). Unless otherwise stated, we assume that T>MlogM+K in the multi-armed bandits case and that T>M in the linear bandits case.

1.2 Our Results

We now briefly discuss our protocols and our algorithmic results. We wish to achieve near-optimal regret guarantee with as little communication as possible. We consider a naive baseline solution first, namely immediate sharing: each agent broadcasts the taken arm and the reward immediately. This solution achieves near-optimal regret in both the MAB and the linear bandits task (O~(MKT) and O~(dMT) respectively), at the cost of high communication complexities (O(MT) and O(MTd) respectively). Our goal is to achieve near-optimal regret with low communication complexity: O~(MKT) regret in distributed MAB and O~(dMT) regret in distributed linear bandits. 22 2 Another trivial protocol is to run independent single agent algorithms with no communication at all. This approach has regret linear in M (Ω(MTK) and Ω(dMT) for MAB and linear bandits respectively).

We summarize our results in Table 1, that compares different algorithms in terms of regret and communication. Our protocols can be separated into two categories: optimism-based protocols and elimination-based protocols.

Setting Algorithm Regret Communication
Multi-armed bandits Immediate Sharing O(MKTlogT) O(MT)
Optimism (Sec. 3.1) O(MKTlogT) O(MKlogM)
Elimination (Sec. 4.1) O(MKTlogT) O(MlogT)
Lower bound (Sec. 5.1)33 3 Here, by “o(MKT) regret and Ω(M) communication”, we mean that for some fixed c, if communication complexity is less than M/c, total regret is necessarily Ω(MKT). See Theorem 5 for details. o(MKT) Ω(M)
Linear bandits Immediate Sharing O(dMTlogT) O(MTd)
Optimism (Sec. 3.2) O(dMTlog2T) O(M1.5d3)
Elimination (Sec. 4.2) O(dTMlogT) O((Md+dloglogd)logT)
Table 1: Summary of baseline approach and our results
Optimism-based Protocols:

The optimism in the face of uncertainty principle is widely implemented in the design of online learning algorithm. UCB algorithm [6] for multi-armed bandits and OFUL algorithm [1] for linear bandits are two celebrated implementations. We extend these algorithms to distributed setting, and achieve near optimal regret for both multi-armed bandits and linear bandits. A surprising result is that our communication complexity in both tasks is independent of the time horizon T.

Elimination-based Protocols:

Phased elimination algorithms [18] is another category of methods for best arm identification and regret minimization problem. They run in episodes (each episode consists of multiple time steps): In each episode, the agent pulls each arm from the living action set (i.e. actions that are not yet eliminated) for even times, and eliminate actions with low estimated rewards at the end of the episode. We extend phased elimination algorithm to distributed setting. Though with additional logT factor, the dependence of other parameters (M,K,d) in communication complexity is better than that of optimism-based protocols in both the MAB setting and the linear bandits setting. In particular, in the MAB setting, the dependence on K is removed.

Lower Bound for Multi-armed Bandits:

We also provide a lower bound of communication complexity in order to achieve non-trivial regret (i.e. o(MKT)) regret for distributed MAB. We show that an expected communication complexity of Ω(M) is necessary. This matches the communication upper bound of Protocol 3 except for a logT factor.

2 Related Work

Multi-armed Bandits

In multi-armed bandits, there are two main categories of algorithms: regret minimization and pure exploration. In the regret minimization setting [6, 9], the agent aims to maximize its cumulative reward in T steps (or, equivalently, minimize regret), for which it is necessary to balance exploration and exploitation. For K-armed bandits, known algorithms have regret O(KT) [4], which matches the lower bound. In the literature of pure exploration [5, 15], the goal of the agent is to identify the best arm (or ϵ-optimal arms) with high probability using as few samples as possible.

Linear Bandits

A natural extension of multi-armed bandits is linear bandits. The action set may be either stationary [13] or time-dependent [1, 12, 20]. When the action set has size K, [20] proved a regret upper bound of poly(loglogKT)O(dTlogTlogK), and a lower bound of Ω(dTlogTlogK), assuming K2d/2. The authors of [1] considered the case where the action set may be infinite; they proved a regret upperbound of O(dlogTT); a corresponding regret lowerbound is Ω(dTlogT), due to [20].

Distributed Bandits

Multi-agent bandit problems of various kinds have been considered in previous works [17, 14, 11, 26, 10, 25]; most of them have a setting different from ours. [14] considers a best arm identification model in multi-agent multi-armed bandits problem. They proposed a phased-elimination algorithm which achieves near-optimal sample complexity to identify ϵ-optimal arms using communication logarithmic in 1/ϵ. [25] employs a P2P network for communication, in which an agent can communicate with only two other agents at a time and their ϵ-greedy based strategy incurs linear communication cost with time horizon T. [10] also proposed an ϵ-greedy based policy with communication linear in logT/Δ2 where Δ is the optimality gap (i.e. the difference between the mean reward of the best arm and the mean reward of the second best one). They claimed that their protocol achieves near-optimal asymptotic regret bounds.

Our work differ from most previous works in two important ways. First, we measure the performance of protocols by worst-case regret; in contrast, many previous works [11, 10] consider asymptotic regret bounds, while others [14, 26] consider pure exploration. Second, we employ a communication model with a central server, while communications in previous works are done via broadcasting [10, 26, 14]. The server could coordinate the exploration of agents based on all the information he received. However, it should be noted that a physical server in the network is not necessary, since an arbitrary agent can be designated as the server.

We noticed there is a concurrent work of Tao et al. [26] which considered distributed multi-armed bandits in a pure exploration setting. They considered two variants of best arm identification: fixed-time (maximize the probability to find the best arm when the time budget is limited) and fixed-confidence (minimize the time steps to find the best arm under fixed confidence). Another difference lies in their message-passing model: in each communication round, an agent is allowed to broadcast one message. The number of communication rounds is then considered as the measure for communication cost.

As a comparison, although it’s not clear how their fixed-time protocol can be related to ours, we note that our protocols for regret minimization can be used to identify ϵ-optimal arms in their fixed-confidence setting. If we set Δ=ϵ, this becomes best arm identification. If we run our algorithms for T=O~(1/Δ2) steps, we are able to identify the best arm with high probability and optimal speed-up (under mild conditions 44 4 e.g. all the agents runs almost equally fast). In particular, our elimination-based protocol gives a distributed best arm identification algorithm with O(logT)=O~(log(1/Δ)) rounds of communication, which is comparable to their results since their protocol also requires O(log(1/Δ)) rounds of communication. Further, our optimism-based protocol induces an algorithm with constant communication cost (i.e. independent of Δ).

3 Optimism-based Protocols

In this section, we propose our optimism-based protocols for multi-armed bandits and linear bandits. Our protocol for multi-armed bandits is based on UCB algorithm. For linear bandits, we extend OFUL algorithm [1] to distributed case.

3.1 A UCB-based Protocol for Multi-armed Bandits

In classic multi-armed bandit problems, upper confidence bound (UCB) algorithms are very efficient in solving regret minimization problem. We show that a natural extension of UCB works well on distributed MAB. In particular, this UCB-based protocol achieves near-optimal regret bound O(MKTlogT) with communication complexity O(MKlogM)–a surprising result that communication complexity is independent of the time horizon T.

3.1.1 Protocol Sketch

We first take a glance at the UCB algorithm in single-agent setting [6]. The reason why this algorithm works well is that the agent form tighter estimates about the mean reward of each arm as he obtains more samples about that arm. The regret incurred by pulling an arm is bounded by the level of confidence of that arm. In particular, if the agent has N independent samples for some arm, then he can form a high confidence interval of length O~(1/N) for it, so he can safely pull this arm for another N times without any additional information, without hurting the regret too much.

We take use of the same idea in distributed setting: the server doesn’t have to initiate a communication about an arm until the total number of pulls for this arm among all agents is doubled. However, the problem is that the server cannot keep track of each agent online in order to know the actual pulls of this arm, since the communication complexity will be linear in MT if so. Therefore, we propose a protocol to enable the server keeping track of the lower bounds of the actual pulls using much less communication.

Concretely, for a fixed arm k, suppose now arm k has been pulled for N times among all agents, so the server needs to initiate a communication after arm k is pulled for another N times. Let l:=log2(N/M),r:=log2N, any agent will send a short message to the server (constant bits) whenever he has pulled arm k for another 2i(lir) times. The server can thereby maintain a lower bound on the number of pulls for arm k, and initiates a communication as long as this lower bound exceeds N. On the other hand, the total number of pulls will also be O(N) when the communication starts if the agents communicate in this way. It’s easy to observe that the total communication cost in this communication round is at most O(M(r-l))=O(MlogM), and further more, we can show that the communication in this episode is actually O(M). We will show afterwards that the total communication rounds will be O(logM), so the total communication complexity will be O(MKlogM) since there are K arms.

3.1.2 Protocol Description

First of all, we will introduce some notations.

  • Np(k): the number of times arm k is actually taken by agents that has been shared before episode p.

  • vp(k): empirical average of the Np(k) observations.

  • Ni(k): the number of times k is taken not yet forwarded to the server by agent i.

  • vi(k): empirical average corresponding to the Ni(k) observations.

We use superscript (p) to denote the value of a variable at the p-th episode. Define


We define


For arm k in episode pk, if the server finds that the total number of pulls among all agents in this episode is equal to or greater than Nmax(pk), then all agents synchronize information of arm k and pk=pk+1. D is a parameter that will be determined in the analysis. {algorithm}[H] Protoco l : Server \[email protected]@algorithmic[5] \Fork=1,,K \Stateset pk=0 \Stateset Ni(k)=0 for i=1,,M \EndFor\WhileTrue \StateReceive Ni(k) from agent i. \StateCalculate the sum of Ni(k) for all agents as N^(k) \IfN^(k)Nmax(pk) \StateSend signal k to agents, \Statepk=pk+1 \WhileNot all agents send their local estimates to the server \StateUpon receiving (k,Ni(k),Vi(k)), update Npk(k) and vpk(k) accordingly \EndWhile\Statesend Npk(k) and vpk(k) to agents. \EndIf\EndWhile



Protocol 1: Agent i \[email protected]@algorithmic[5] \Fork=1,,K \Stateset pk=0 \EndFor\WhileTrue \StateTake action k with largest UCB(k) \StateObserve reward, update Ni(k) and vi(k) \IfNi(k){2j:j} and Ni(k)Nmax(pk)/M for some k \StateSend Ni(k) to server \EndIf\Ifreceive signal (k) from server \StateSend (k,Ni(k),vi(k)) to server, set Ni(k) and vi(k) to 0. \Statepk=pk+1 \StateUpon receiving (k,N(k),V(k)), update local Npk(k) and vpk(k) accordingly \EndIf\EndWhile

3.1.3 Analysis

We will state some useful lemmas first.

Lemma 3.1.

For any p0, and any arm k[K],


and for any p1,




Note that Np(k)-Np-1(k) is the actual number of pulls for arm k among all agents in episode p-1 (the episodes are numbered from 0). It’s easy to observe Np(k)-Np-1(k)Nmax(p-1) by our protocol, so Np(k)i=1p-1Nmax(i)=Nmax(p).

To upper bound Np(k)-Np-1(k), we divide the agents into two groups:

The first group contains agents that do not send any notifications to the server before the server tells him to communicate in episode p-1. Let P0 be the total number of pulls for arm k by the first group. Obviously each agent in the first group pulls arm k for at most 2Nmax(p-1)/M times, so P02Nmax(p-1). The second group, on the other hand, contains agents that send messages to the server at least once. Let P1 be the total number of pulls for arm k by this group. Consider the last communication in episode p-1 before the server decides to initiate a communication round. Denote the value of N^(k) before this communication by C0, and the value of N^(k) after this communication by C1. Observe that C12C0<2Nmax(p-1) and P12C1. Therefore, P14Nmax(p-1).

Finally, we have Np(k)-Np-1(k)=P0+P16Nmax(p-1), which implies Np(k)6Nmax(p-1).

Using Azuma-Hoeffding’s inequality and union bound, we can prove the following lemma.

Lemma 3.2.

Throughout the algorithm, if there are m samples for arm k, with probability at least 1-M/T3,


where c is some constant.

Furthermore, we can bound the communication cost in a single episode.

Lemma 3.3.

The communication complexity in one episode is O(M).


For any arm k, suppose the server needs to initiate a communication after this arm get pulled for another N times. It’s easy to observe that the total communication in updating Np(k) and vp(k) is O(M). Now we turn to the communication used for the server to keep track of N^(k).

Let l:=log2(N/M),r:=log2N, then the agents will send a notification to the server if he has pulled arm k for 2i times where i=l,l+1,,r. The server maintains a counter N^(k), which calculates the total number of pulls that the agents have reported to him. If server finds that N^(k)N, he will initiate a communication.

If the server received a notification claiming some agent has pulled arm k for 2i times, then N^(k) will increase by 2i-1 (if i>l) or 2i (if i=l). This means whenever the server received a message, N^(k) will increase by at least 2l. Therefore, the communication cost in this episode is at most N/2l+O(M)=O(M). ∎

Theorem 1.

We can achieve expected regret O(KMTlogT) with communication complexity O(MKlogM).


Regret: Let Nk be the total number of pulls for arm k in MT steps. We use L~ to hide constants and logT.

The expected regret caused by the failure of lemma 3.2 is at most MT1/(MT3)=O(1), thus we assume lemma 3.2 holds throughout our algorithm.

We divide REG(T) into two parts REG(T)=REG1(T)+REG2(T). REG1(T) denotes the regret incurred by all the arms when they are in episode 0, REG2(T) denotes the regret incurred in episodes >0.

Observe that for any arm, the regret incurred in episode 0 is at most O(MD). The worst case is that the O(D) pulls (at most 6D by lemma 3.1) in episode 0 are distributed evenly among all agents. Thus, REG1(T)O(KMD).

For any arm k with Nk>D, the total number of episodes for this arm will be log2(Nk/D). According to our upper confidence bound nature, we can bound REG2(T) as follows:

RET2(T) k=1Kp=1log2(Nk/D)L~Np(k)(Np+1(k)-Np(k))

The first equality is by lemma 3.1.

Therefore, we have


Choose D=T/K. Regret is O(KMTlogT).

Communication: Using lemma 3.3, communication complexity can be bounded by

O(Mk=1KlogNkD) O(MKlog(kNk/KD))

3.2 An OFUL-based Protocol for Linear Bandits

In this subsection, we propose a distributed algorithm based on OFUL algorithm [1]. We show that our protocol can achieve O(dMTlog2(T)) regret with communication complexity O(M1.5d3). Similar to that of our UCB-based protocol for multi-armed bandits, the communication complexity of this protocol is independent of the time horizon T. Note that though action set 𝒟 does not change with time in our basic setting, our results can be directly applied to the case with time-variant action set 𝒟t.

3.2.1 Protocol Sketch


[H] OFUL algorithm

\[email protected]@algorithmic

[1] \Fort:=1,2, \State(xt,θ~t)=argmax(x,θ)𝒟×𝒞t-1x,θ \StatePlay xt and observe reward yt \Stateupdate 𝒞t \EndFor

To begin with, we review the OFUL algorithm [1], which is based on the optimism in the face of uncertainty (OFU) principle. The core idea is to maintain a confidence set 𝒞t-1d for the parameter θ*. In each step, the algorithm chooses an optimistic estimate θ~t=argmaxθ𝒞t-1(maxx𝒟x,θ) and then chooses action Xt=argmaxx𝒟x,θ~t, which maximizes the reward according to the estimate θ~t. 𝒞t is calculated from previous actions (x1,x2,,xt) and rewards (y1,y2,,yt):

𝒞t={θd:θ^-θV¯t2log(det(V¯t)1/2det(λI)-1/2δ)+λ1/2S}, (1)

where V¯t=λI+τ=1txτxτT and θ^=(λI+τ=1txτxτT)-1τ=1txτyτ.

Our observation is that the volume of the confidence ellipsoid depends on det(V¯t). If det(V¯t) does not vary greatly, it will not influence the confidence guarantee even if the confidence ellipsoid is not updated. We divide all steps into several epochs and only synchronize samples at the end of each epoch. As det(V¯T) is upper bounded, the number of those bad epochs, in which det(V¯t) varies greatly, is limited. This inspires our analysis. Meanwhile, in our algorithm, we also use det(V¯t) as part of our synchronization condition.

3.2.2 Protocol Description

In our distributed algorithm, each agent uses all samples available for him to compute the confidence ellipsoid. These samples are composed of two parts: samples obtained before last synchronization and samples that haven’t been uploaded. Once the condition of synchronization is satisfied for at least one agent, all agents send data to the server, and the server returns processed data.

We denote τxτxτT and τxτyτ as W and U in our algorithm respectively. For the sake of communication and computation efficiency, we only transmit W and U in communication rounds.



Protocol 2: For Server \[email protected]@algorithmic[5] \StateWsyn=0, Usyn=0 \Statetlast=0 \WhileTrue \StateWait until receiving synchronization signal from an agent \StateSend uploading signals to all agents \StateReceive Wnew,Unew from agent i as Wnew,i,Unew,i. \StateCalculate Wsyn=Wsyn+i=1MWnew,i, Usyn=Usyn+i=1MUnew,i \StateSend Wsyn and Usyn to all agents. \EndWhile



Protocol 2: For Agent i \[email protected]@algorithmic[5] \StateWsyn=0, Usyn=0 \StateWnew=0, Unew=0 \Statetlast=0, Vlast=λI \Fort=1,,T \StateComputes the confidence ellipsoid using Eq.1 based on Wsyn+Wnew and Usyn+Unew. Denote his results as Vt,i, θ^t,i and 𝒞t,i \State(xt,i,θ~)=argmax(x,θ)𝒟×𝒞x,θ \Stateagent i plays xt,i and gets the result yt,i \StateWnew=Wnew+xt,ixt,iT, Unew=Unew+xt,iyt,i. \Iflog(detVt,i/detVlast)(t-tlast)>D \StateSend a synchronization signal to the server to initiate a communication round. \EndIf\Ifreceiving uploading signal from the server \StateUpload Wnew,Unew \StateReceive Wsyn,Usyn from the server \StateWnew=0, Unew=0 \Statetlast=t, Vlast=λI+Wsyn \EndIf\EndFor D is a parameter that will be determined in our analysis.

3.2.3 Analysis

First of all, we state lemmas from previous results that will be useful in our proof.

Lemma 3.4.

(Theorem 2 in [1]) With high probability, θ* always lies in the constructed Ct,i for all t and all i.

Lemma 3.5.

(Lemma 11 in [1]) Let {Xt}t=1 be a sequence in Rd, V is a d×d positive definite matrix and define V¯t=V+s=1tXsXs. Then we have that


. Further, if Xt2L for all t, then t=1nmin{1,XtV¯t-1-12}2(logdet(V¯n)-logdetV)2(dlog((trace(V)+nL2)/d)-logdetV), and finally, if λmin(V)max(1,L2) then


Using Lemma 3.4, we can bound single step pseudo-regret rt,i.

Lemma 3.6.

With high probability, single step pseudo-regret rt,i=θ*,x*-xt,i is bounded by


Assuming θ*𝒞t,i,

rt,i =θ*,x*-θ*,xt,i

Theorem 2.

Our protocol can achieve a regret of O(dMTlog2(T)) with O(M1.5d3) communication complexity.


Regret: In our protocol, there will be a number of epochs divided by communication rounds. We denote Vlast in epoch p as Vp. Suppose that there are P epochs, then VP will be the matrix with all samples included.

Observe that detV0=det(λI)=λd. det(VP)(tr(VP)d)d(λ+MTL/d)d. Therefore


Let R=dlog(1+MTLλd). It follows that for all but R epochs, we have


We call these epochs good epochs. In these epochs, we can use the argument for theorem 4 in [1]. First, we imagine the MT pulls are all made by one agent in a round-robin fashion. I.e., he takes x1,1, x1,2, …, x1,M, x2,1, …, x2,M, …, xT,M. We use V~t,i to denote the V¯ this imaginary agent calculates when he gets to xt,i. If xt,i is in one of those good epochs(say the p-th epoch), then we can see that



rt,i O(dlogTδ)xt,iTV¯t,i-1xt,i

We can then use the argument for the single agent regret bound and prove regret in these good epochs.

For epoch p, we denote regret in all good epochs as REGgood. Suppose p means the set of (t,i) pairs that belong to epoch p, and Pgood means the set of good epochs, using lemma 3.5, we have

REGgood =tirt,i

Now we focus on epochs that are not good. Consider such a bad epoch where we reindex the matrices for convenience. Suppose at the start we have a Vlast. Suppose that the length of the epoch is n. Then agent i proceeds as V1(i),,Vn(i). Our argument above tells us that


Now, for all but 1 agent, nlogdetVn(i)detVlast<D. Therefore we can show that


We also know that the number of such epochs are rare. (Less than R=O(dlogMT)). Therefore the second part of the regret is


If we choose D=(TlogMTdM), then REG(T)=O(dMTlog2(MT)). Since T>M, we have


Communication: Let C=(DTR)0.5. There could be at most T/C such epochs that contains more than C rounds. For those containing less than C steps, log(detVj+1detVj)>DC. There could be at most RD/C=RCD such epochs. Therefore, the total number of epochs is bounded by


With our choice of D, this is


At each communication round, we need to pass O(Md2) amount of data. Hence total communication complexity should be


4 Elimination-based Protocols

In this section, we discuss elimination-based protocols for MAB and linear bandits. Compared with that of optimism-based protocols, the communication complexities of elimination-based protocols enjoy a better dependence on K (MAB setting) or M and d (linear bandits setting); however, in both cases, an additional logT factor is introduced.

4.1 An Elimination-based Protocol for Multi-armed Bandits

In this subsection, we propose a distributed version of the phased-elimination algorithm [7]. We show that it achieves O(MKTlogT) regret with communication complexity of O(MlogMKT). Note that under our usual assumption of T>M+K, the dependence on K is removed.

4.1.1 Protocol Sketch

The phased-elimination algorithm [7] works with the following principles: 1. it maintains a set of good arms A, which is initially set to be [K]; 2. at every phase, each arm within the set is pulled for the same number of times; 3. at the end of each phase, bad-performing arms are eliminated, with a margin that halves each phase.

It can be noted that the phased-elimination algorithm seems well-suited for distributed applications: the task of sampling each arm can be easily parallelized. If the number of remaining arms is greater than M, we assign arms to agents (each agent gets multiple arms); if the number of remaining arms is less than M, we assign agents to arms (each agent is assigned to one arm). We design our distributed elimination algorithm based on the following additional observations:

  1. 1.

    If the number of arms is less than M, we can easily simulate phased elimination with O(M) communication;

  2. 2.

    If the number of arms is greater than M, but at the end of a phase, the number of remaining arms assigned to each agent remains near-uniform55 5 By sayiung a sequence of positive numbers is near-uniform, we mean that the maximum is at most 2 times the minimum., we only need to compute the best arm in this phase (O(M) communication) and proceed to the next phase;

  3. 3.

    By assigning arms to agents randomly at initialization, the number of remaining arms is likely to remain near-uniform, until the number of remaining arms drops to O(MlogM).

4.1.2 Protocol Description



Protocol 3: For Server \[email protected]@algorithmic[1] \StateInput: K,M,T \StateGenerate K random numbers from uniform distribution on [M]; send K/M of these random numbers to each agent (Share Randomness) \Statek0=K, kmax=K/M, A= \Forl=1,2,3, \StateSend kmax to each agent\Ifkl-1<C0M \StatePartition arms in Al to agents evenly; schedule ml=4l+2lnMKT pulls for each arm \StateWait for all results to return \StateFor each iAl, calculate μ^i,l: the empirical mean of reward associated to arm i \StateEliminate low-rewarding arms:


kl=kl-1 \Else\StateWait for each agent to report (aj,l*,μ^j,l*) in this phase \StateCalculate ul*=maxj[M]μ^j,l*, and send to every agent \StateWait for each agent to report their remaining number of arms nl(j) \Ifmaxj[M]nl(i)>2minj[M]nl(i) \StateRearrange arms to make number of arms near-uniform \EndIf\StateUpdate kmax to be the maximum number of arms a agent holds; update kl to be the total number of remaining arms \Ifkl<C0M \StateAsk agents to report all remaining arms, gather them in Al+1 \EndIf\EndIf\StateNotify agents to proceed \EndFor



Protocol 3: For Agent i \[email protected]@algorithmic[1] \StateInput: K,M,T \StateA(i) is given by the set of random numbers received at the beginning (Shared Randomness) \Statek0=K \Forl=1,2,3, \StateReceive kmax \Ifkl-1<C0M \StateReceive (Ai,l,m) from the server \ForjAi,l \StatePull arm j for m times \EndFor\StateUpload the average reward to the server; kl=kl-1 \Else\StatePull each arm ml=4l+2lnMKT times and store the average for arm a as u(a,l) \StatePull arms arbitrarily to wait to make total pulls ml×kmax \StateSend (argmaxu(,l),maxu(,l)) to server, wait for server to calculate maximum ul* \StatePerform elimination on local arm set


report remaining number of arms nl(i)=|Al+1(i)| to server, wait for instruction \StateUpdate kl \EndIf\StateWait for notification from server \EndFor

Note: By “allocate arms to agents evenly” and “schedule m pulls”, we mean the following. If |Al|M,

  • The server partitions Al into M disjoint sets: A1,lA2,lAM,l such that

  • mi,l is set to be ml for all i

  • If |Al|/M is not an integer, we set flagi to 1 if and only if


    This way, within each l, the total number of pulls for any agent will be the same.

If |Al|<M,

  • The server partitions [M] into n=|Al| disjoint sets B1Bn such that

  • If agent i is in Bj, the instruction will be


When timestep T+1 is reached, we automatically terminate the protocol.

4.1.3 Analysis

First of all, we state the following facts regarding the execution of the protocol.

Fact 1.

We use Al to refer to the set of remaining arms at beginning of phase l (it is either Al or iAl(i) in the code). Then Al behaves as if a centralized phased-elimination algorithm is acting on it.

Fact 2.

For a fixed arm i, in a given phase l, i is pulled (by all agents) at most max{2ml,M/K+1} times.

Fact 3.

Let L=lnMT. Then the largest reached l when executing the protocol is at most L.


If the protocol terminates at the l-th phase, then the l-1-th phase must be terminated. Therefore 4l+1lnMKTMTexp{L}. We conclude that l<L. ∎

For convenience, if the protocol is terminated during the l-th phase (l<L), we define Al+1,…,AL to be Al.

Next, we use arguments similar to those used in the single agent phased-elimination algorithm, and prove the following useful lemmas. With out loss of generality, we assume that arm 1 is the optimal arm. Let Δk=μ(1)-μ(k)0.

Lemma 4.1.

With probability 1-2/(MKT), for all phases l, all arms iAl,


For any fixed l, fixed iAl, μ^i,l is the empirical average of ml samples. Therefore by Hoeffding’s inequality,

Pr[|μ^i,l-μi,l|>2-l-1] 2exp{-ml2-2l8}

The number of phases is at most LMT. Taking a union bound over all i and l proves the theorem. ∎

Lemma 4.2.

With probability 1-2/(MKT), the following holds: 1. With probability , 1Al for all l.
2. For i[K] with Δi>0, define li=log2Δi-1+1. Then iAli+1.


We condition on the event described in lemma 5.1. Observe that


Therefore arm 1 will never be eliminated. This proves the first assertion. On the other hand, at the end of round li,

μ^i,l μ(i)+2-li-1=μ(1)-Δi+2-li-1μ(1)-2-li-1.

Therefore even if i remains to Ali, it will be eliminated at that round. This proves the second assertion. ∎

We are now ready to state and prove our main results for the elimination-based algorithm for distributed multi-armed bandit.

Theorem 3.

Protocol 3 achieves expected regret O(MKTlogT). It has communication complexity bounded by O(MlogT+K) (worst case), and expected communication complexity O(MlogMKT).


Regret: Let nT(i) be the number of times arm i is pulled upon termination, which is a random variable. We know that total expected regret is


We also know that in phase l, either (1) the phase only lasts for 1 step (this happens when kl-1mlM) or (2) any arm is pulled for at most 2ml times. If the phase only lasts for one step, then it holds that (i) there would be at most M/K+12M/K pulls for an arm; and that (ii) 4lln(MKT)mlM.

So by the lemmas above,

𝔼[nT(i)] l=1li2ml+2MKTMT+2MlnMK

where C1 and C2 are universal constants. Assume that T>MlnM+K. Then

REG(T) i:Δi>ϵC11ΔilnMKT+i:Δiϵ𝔼[nT(i)]Δi+C2MlnM

Choose ϵ=KlnT/(MT). Then we get REG(T)C5KMTlnT. Herer C3, C4 and C5 are all universal constants.

Communication: First, consider the protocol running with kl-1<C0M. Then, sending instructions involve sending at most 2C0M numbers; similarly receiving averages involve communication O(M) of numbers. Hence, after a protocol has kl-1<C0M, total communication is at most O(ML)=O(MlogT).

Now, consider the case with kl-1>C0M. Suppose that at the end of phase l, if no reassignment is needed, then communication in a phase is still O(M). If reassignment is needed, the number of communicated numbers is at most the number of arms eliminated between that phase and the last reassignment (or assignment). Therefore, when running with kl-1>C0M for several phases, total communication is bounded (up to a universal constant) by MlogT plus the number of eliminated arms. Therefore the total communication in the worst case is O(MlogT+K).

Now we focus on expected communication complexity. First, observe that the elimination process and the allocation of arms is completely independent. Therefore, we can apply concentration inequality for the set of remaining arms (although this set of arms is random, independence enables us to not use a union bound). More specifically, let

Xj,l=I[jAl],Yj,i=I[arm j is allocated to agent i initially].

Then Xj,l and Yj,i are independent. Now suppose that no reassignment happened until phase l-1. In phase l, after elimination, the number of arms remaining for agent i is nl(i)=j=1KXj,lYj,i. We have that


By Chernoff’s bound,

Pr[nl(i)>1.2𝔼[kl]M]+Pr[nl(i)<0.8𝔼[kl]M] <δ,

as long as


(C6 is some universal constant.) Here, we only need δ to be (MKL)-1. After taking a union bound on l and i, we arrive at the following statement: for any l, if


then with probability 1-1/K, there will be no reassignment until phase l.

On the other hand, if


we can see that remaining total communication is


by our worst case argument above. Therefore, total expected communication is upperbounded by


Under our usual assumption that T>MlnM+K, this can be simplified to O(MlogT). ∎

One caveat in this protocol is that it uses KlogM bits of public randomness. However, using Newman’s theorem [22], we can remove this usage of public randomness at the cost of O(Mlog(MT)) bits of additional communication. Since the task is not evaluating a function here, the argument is a simple modification of that of Newman’s theorem (see Appendix. B).

4.2 An Elimination-based Protocol for Linear Bandits

This protocol is based on a single agent algorithm (algorithm 12 in [18]), which also iteratively eliminate arms from the initial action set. Similar to Protocol 3, we parallelize the data collection part of each phase, and send instructions in a communication-efficient way.

4.2.1 Protocol Description

For convenience, we define V(π)=a𝒟π(a)aaT, g(π)=maxa𝒟aTV(π)-1a. {algorithm}[H] Protocol 4: For Server \[email protected]@algorithmic[1] \StateA1=𝒟 \Forl=1,2,3, \StateFind distribution πl() over Al such that: 1. its support has size at most ξ=32dlnlnd; 2. g(π)2d. \StateSchedule ml(a) pulls for each arm aAl; ml(a)=C14ld2πl(a)lnMT \StateReceive rewards for each arm aAl reported by agents \StateFor each arm in the support of πl(), calculate the average reward μ(a) \StateCompute


Send θ^ to all agents \StateEliminate low rewarding arms:



Protocol 4: For Agent i \[email protected]@algorithmic[1] \StateA1=𝒟 \Forl=1, \StateFind distribution πl() over Al such that: 1. its support has size at most ξ=32dlnlnd; 2. g(π)2d. \StateReceive a set of arms, and numbers to pull each of them \StatePull the arms according to instruction \StateReport the average reward for each arm \StateReceive θ^ from server \StateEliminate low rewarding arms on local copy:


As above, if the protocol reaches timestep T+1, it is automatically terminated. The specific meaning of “schedule ml(a) pulls for arm a” will be explained in the proof of Theorem 4.

The task in line 3 of the protocol is closely related to finding a G-optimal experiment design [23]. In the G-optimal design task, we require g(π) to be actually minimized, and the minimum value is d [16]. For the exact G-optimal design task, it is known that there is a solution π* such that the support of π* (also known as core set in literature) has size at most ξ=d(d+1)/2 [27]. Here, we only need a 2-approximation to g(π), but require the solution to have a small support size S.

This task can be shown to be feasible. In particular, when 𝒟 is finite, the Frank-Wolfe algorithm under appropriate initialization can find such an approximate solution (see Proposition 3.19 [27]). When 𝒟 is infinite, we may replace 𝒟 with an epsilon-net of 𝒟 (and only take actions in the ϵ-net). If ϵ<1/T, then our regret bound when the action set is the ϵ-net implies the regret bound when the action set is 𝒟. This approach, albeit feasible, may not be efficient.

4.2.2 Analysis

First, we consider some properties of the above protocol.

Fact 4.

The length of the l-th phase is

Lemma 4.3.

At phase l, with probability 1-1/TM, for any aD,


First, construct an ϵ-covering of 𝒟 with ϵl=2-l-2. Denote the center of the covering as X={x1,,xQ}. Here Q(C2)d(l+2).

For fixed a𝒟, we know that with probability 1-2δ,


In our case,


Therefore with probability 1-2δ,


Choose δ=1/(2TMQ). We can see that that there exists a C1 such that with probability 1-1/(TM), for all aX


Now, consider an arbitrary a𝒟. There exists xX such that a-x2-l-1. Therefore with probability 1-1/TM, for any a𝒟,

|θ^-θ*,a| |θ^-θ*,x|+|θ^-θ*,a-x|

Lemma 4.4.

Let a*=argmaxaDθ*,a be the optimal arm. Then with probability 1-log(MT)/(TM), a* will not be eliminated until the protocol terminates.


If a* is eliminated at the end of round l, one of the following must happen: either (1) |θ^-θ*,a|>2-l; or (2) there exists aa*, |θ^-θ*,a|>2-l. Therefore the probability for a* to be eliminated at a particular round is less than 1-1/(TM). The total number of phases is at most logMT. Hence a union bound proves the proposition. ∎

Lemma 4.5.

For suboptimal aD, define la=inf{l:42-lΔa}. Then with probability 1-δ, for any suboptimal a, aAla. δ=2log(TM)/TM.


First, let us only consider the case where a* is not eliminated. That is,

Pr[a𝒟:aAla] Pr[a* eliminated]+Pr[a:aAla-1,aAla|a*Ala].

Note that conditioned on a*Ala, {aAla-1aAla} implies that at phase la, either |θ^-θ*,a|>2-la or |θ^-θ*,a*|>2-la. Therefore the probability that there exists such a is less than log(TM)/TM. Hence, with probability 1-2log(TM)/TM, a will be eliminated before phase la. ∎

We are now ready to state and prove our main result for Protocol 4.

Theorem 4.

Protocol 4 has expected regret O(dTMlogT), and has communication complexity O((Md+dloglogd)logT).


Regret: We note that at the start of round l, the remaining arms have suboptimality gap at most 42-l. Suppose that the last finished phase is L. Therefore total regret is

REG(T) l=1LC14ld2logMT42-l+δ2MT

Apparently C14Ld2logTMTM. Therefore


Under our usual assumption that T>M, this can be simplified to O(dTMlogT).

Communication Complexity: Observe that at the start of each phase, each agent has the same Al as the server. Therefore, at line 3 they obtain the same πl. In that case, when allocating the arms, the server only needs to send a index (log2ξ bits), instead of a vector (Ω(d) bits), to identify an arm.

Now we specifically state how to schedule ml(a) pulls for each arm. Let us use pl=aml(a)/M to denote the average pulls each agent needs to perform. Starting from the arm with the largest ml(a), we schedule pulls in full blocks whenever possible. For instance if ml(a)=4.1pl, we give assign four agents with arm a only; and for the fifth agent, we assign him with 0.1p of load. Suppose that this scheduling is done one arm by another with descending load. Continuing the above example, if for the next arm ml(a)=3.5pl, we assign 0.9pl to the fifth agent, and the rest to 3 agents. Observe that for each arm, the number of agents that it is assigned to is at most 1+ml(a)/pl agents. Therefore, total communication for scheduling is at most


Similarly, total communication for reporting averages is the same. The cost for sending θ^ is Md. Hence, communication cost per phase is O(Md+dloglogd). On the other hand, total number of phases is apparently O(logTM). Hence total communication is


Under the assumption that T>M, this can be simplified to O((Md+dloglogd)logT). ∎

Note that the replicated calculation of πl() on the agent’s side is a trick purposed to reduce the number of bits/packets required to designate an arm. If this trick is not used, the server may need d numbers, instead of one, to identify an arm; therefore the total communication complexity will become O((Md+d2loglogd)logT).

5 Discussion

5.1 Lower Bound for Multi-armed Bandits

Intuitively, in order to avoid a Ω(MKT) scaling of regret, O(M) amount of communication is necessary: otherwise most of the agents can hardly do better than a single-agent algorithm. We prove this intuition in the following lower bound(which applies to the expected communication complexity of randomized protocols).

Theorem 5.

For protocols with expected communication complexity less than M/c, there exists multi-armed bandits instances such that total regret scales as Ω(MKT). Here c=3000.

The proof of this lower bound is a simple reduction from single-agent bandits to multi-agent bandits, mapping protocols to single-agent algorithms (see Appendix. C). Since one can simulate MAB with linear bandits by setting an action set of K orthogonal vectors, this directly induces a Ω(MdT) lower bound for linear bandits when communication complexity is less than M/c.

We note that our results in distributed multi-armed bandits match this lower bound except for a logT factor. In this sense our protocol for multi-armed bandits is indeed communication-efficient. Moreover, since Ω(MKT) is achievable with no communication at all, it is essentially the worst regret for a communication protocol. Therefore, this lower bound suggests that the communication-regret trade-off for distributed MAB is a steep one: with O(MlogT) communication, regret can be near-optimal; with slightly less communication (M/c), regret necessarily deteriorates to the worst case.

5.2 Comparison of Optimism-based and Elimination-based Protocols

In the distributed multi-armed bandit task and linear bandit task, optimism-based protocols (Protocol 1 and 2) require O(MKlogM) and O(M1.5d3) communication respectively. Elimination-based protocols, however, require only (MlogT) and O(MdloglogdlogT). In the regime where T is not exponentially large, elimination-based protocols seem to be more communication-efficient than optimism-based protocols. In particular, the dependence on K or d is significantly lower; this is mostly because elimination-based protocols allocate arms to agents.

On the other hand, we also note that optimism-based protocols (Protocol 1 and 2), compared to elimination-based ones (Protocol 3 and 4), seem somewhat more natural and practical. For instance, the optimization involved in Protocol 4 is potentially much harder than the optimization used in Protocol 2.

5.3 Future Work

In this work, we propose communication-efficient protocols with near-optimal worst-case regret bounds. However, this may not guarantee close-to-centralized performance for a particular instance. Adopting our results to the case where instance-dependent regret is the performance measure may be an interesting problem. Another interesting problem is providing a better communication lower bound for distributed linear bandits (in order to achieve near-optimal worst-case regret). We conjecture that close-to-Md amount of communication may be necessary, which is the case in offline distributed linear regression [29, 8] in the context of achieving an optimal risk rate.

It will also be great if the two design principles, namely optimism and elimination, can be combined to get the best of both worlds. Communication complexity of such protocols potentially may enjoy small dependencies on timescale T and size of action set (K or d) simultaneously.

We also realize that real-world distributed systems typically suffer communication latency, and computation of the agents may be asynchronous. Therefore it would be interesting to consider protocols that tolerates latency and delays. However, we note that to some extent, the synchronous case is the hard case for regret minimization, as there are M agents working in concurrent throughout the execution. If we can keep most agents and only actively run 1 agent at a time, optimal regret can be easily achieved by simulating a single-agent algorithm, using communication less than M times the space complexity of that single-agent algorithm.

6 Acknowledgments

The authors thank Chi Jin, Chicheng Zhang and Nan Jiang for helpful discussions.


  • [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • [2] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
  • [3] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems, pages 17–24, 2009.
  • [4] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
  • [5] Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pages 13–p, 2010.
  • [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • [7] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
  • [8] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020. ACM, 2016.
  • [9] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [10] Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
  • [11] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164–170. ijcai.org, 2017.
  • [12] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • [13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
  • [14] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854–862, 2013.
  • [15] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
  • [16] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
  • [17] Nathan Korda, Balázs Szörényi, and Li Shuai. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, volume 48, pages 1301–1309. International Machine Learning Societ, 2016.
  • [18] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2019.
  • [19] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
  • [20] Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. arXiv preprint arXiv:1904.00242, 2019.
  • [21] Oded Maron and Andrew W Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in neural information processing systems, pages 59–66, 1994.
  • [22] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
  • [23] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
  • [24] Alistair Sinclair. Lecture 11: February 20. https://people.eecs.berkeley.edu/~sinclair/cs271/n11.pdf.
  • [25] Balázs Szörényi, Róbert Busa-Fekete, István Hegedűs, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056–1064. International Machine Learning Societ, 2013.
  • [26] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. arXiv preprint:1904.03293, 2019.
  • [27] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms, volume 23. SIAM, 2016.
  • [28] You-Gan Wang. Sequential allocation in clinical trials. Communications in Statistics-Theory and Methods, 20(3):791–805, 1991.
  • [29] Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013.

Appendix A A A Reduction from Regret Minimization Protocols to Best Arm Identification Protocols

We show how to produce a communication protocol identifying the best arm with high probability using our regret minimization protocols.

Theorem 6.

For any multi-armed bandits instance ν, suppose arm 1 is the best arm. Let Δi(i>1):=μ(1)-μ(i),Δmin:=mink>1Δk,Δmax:=maxk>1Δk. We assume that Δmax/Δmin<cd for some constant cd>0.

Assume that we have a a regret minimization protocol A(M,K,T), in which the server is able to calculate the number of pulls among all agents for each arm at the end of the protocol (our protocol 1 and protocol 3 both satisfy this property). Moreover, A(M,K,T) achieves near-optimal minimax regret bound O(MKTlogT).

Given A(M,K,T) running on ν, there exists a new protocol A(M,K,ϵ) that identifies the best arm as long as Δmin/2<ϵ<Δmin with probability at least 1-δ for any δ>0. Furthermore, A(M,K,ϵ) has Ω~(M) speed-up, with communication cost at most CA(M,K,O~(K/(Mϵ2))).


𝒜(M,K,ϵ) runs on ν as follows:

  1. 1.

    Let T1 be some integer that T1=O~(K/(Mϵ2)). Run 𝒜(M,K,T1).

  2. 2.

    Suppose the global pulls for arm k among all agents at the end of 𝒜 is Nk (maintained by the server), the server randomly choose an arm k* with probability Nk/k=1KNk for arm k.

  3. 3.

    returns arm k*.

Let’s analyze protocol 𝒜 now.

Recall that


so we have


By choosing proper T1=O~(K/(Mϵ2)), we have


Therefore, by Markov’s inequality,


Note that by median trick [24], in order to bound the error probability by δ, the server only needs to sample k* independently for O(log(1/δ)) times, and return the majority vote of them.

It’s easy to observe the communication cost of 𝒜 by C𝒜(M,K,T1)=C𝒜(M,K,O~(K/(Mϵ2))).

According to [5], at least Ω~(log(1/δi=1K1/Δi2) samples are required by any single-agent algorithm to identify the best arm with probability 1-δ. By our assumption we know that K/ϵ2=O(i=1K1/Δi2), therefore the speed-up is Ω~(M).

Note that if 𝒜 is Protocol 3, we are able to identify the best arm using O~(log(K/MΔmin)) communication rounds. Further more, if 𝒜 is our Protocol 1, we can identify the best arm with communication complexity independent of Δmin; hence it is a very efficient protocol when Δmin is small.

Appendix B B Removing Public Randomness

Protocol 3 exploits a certain amount of shared randomness. We now discuss how to remove the usage of public randomness with little increase in communication complexity.

The role of shared randomness in communication complexity has already been investigated; it is known that we can efficiently “privatize” shared randomness, as stated by the Newman’s Theorem [22]. In our case, the arugment is slightly different: we are considering regret instead of evaluating a function. In particular, we show the following theorem.

Theorem 7.

There exists a protocol that does not use public randomness with expected regret O(MKTlogT). It has communication complexity bounded by O(MlogT+K) (worst case), and expected communication complexity O(MlogT).


We make the following modifications to Protocol 3. Instead of using a public random bit string r, we predetermine B strings r1,…,rB, and randomly choose from them. That is, the server will generate a random number in [B], and broadcast it to everyone. The communication cost of doing so will be MlogB. We now analyze how the choice of r1,,rB affects our protocols performance.

Now define f(X,r) to be the expected regret for our protocol uses the public random string r and interacts with the multiarmed bandit X. Our analysis above tells us that X,


Therefore, if we draw i.i.d. r1,,rB,


We say that a set of random strings is bad for a bandit X if


We know that 0f(X,ri)MT. Therefore, by Hoeffding’s inequality,


In other words, for fixed X, the probability that we will draw a bad set {r1,,rB} is exponentially small. Therefore, for a family of bandits with size Q, the probability for drawing a set of r1,,rB that is bad for some bandit is at most Qexp{-2BC12KlnTMT}. If we can show that this is smaller than 1, it would follow that there exists r1,…,rB such that it is good for every bandit in the set.

Now, we consider the following family 𝒳 of bandits. For each arm, the expected reward could be a/Δ, where 0aΔ-1. The size of this family is Q=Δ-K. Now, consider any other bandit X. Apparently we can find a bandit X𝒳 such that their expected rewards are Δ-close in . Then, by applying information-theoretic arguments used in the proof for regret lower bounds [18], we can show that for any policy, their expected rewards differ by at most O(MTMTΔ2). Note that a communication protocol is just a special form a single-agent policy on a MT-length trial; so this bound applies just as well. Therefore, if Δ=(MT)-1.5, it suffices to consider bandits in 𝒳. In this case, Q=(MT)1.5K.

Therefore, we only need guarantee that BKlogTMT>CKlogMT, where C is a universal constant. This can be met by setting B=CMT(log(M)+1). In this case, the additional communication overhead is


Therefore, under our usual assumption of T>M+K, total communication is bounded by O(MlogT). ∎

Appendix C C Proof for Theorem 5


First of all, we list two lemmas that will be used in our proof.

Lemma C.1.

(Theorem 9.1 [18]) For K-armed bandits, there is an algorithm with expected regret

Lemma C.2.

(Theorem 15.2 [18]) For K-armed bandits, we can prove a minimax regret lower bound of


The original lower bound is proved for Gaussian bandits, which doesn’t fit exactly in our setting. we modified the proof to work for Bernoulli bandits, which resulted in a different constant.

We now prove the theorem’s statement via a reduction from single agent armed bandit to multi-agent armed bandit. That is, we map communication protocols to single-agent algorithms in the following way. Consider a communication protocol with communication complexity B(M). We denote Xi (i[M]) to be the indicator function for agent i’s sending or receiving a data packet throughout a run. Xi is a random variable. Since expected communication complexity is less than M/c,


Now consider the M/2 agents with smallest 𝔼Xi. Apparently all of them have 𝔼Xi2/c. That is, for any of these agents, the probability of either speaking or hearing from someone is less than 2/c. Suppose that agent j is such a agent. Then, we can map the communication protocol to a single-agent algorithm by simulating agent j.

The simulation is as follows. Facing single agent bandit with time T, we run the code for agent i in the protocol. When no communication is needed, we are fine (continue to the next line of code). When the code mentions anything related with communication (e.g. send a message, wait for a message, etc.), we terminate the code. For the rest of the timesteps, we run a single-agent optimal algorithm (the one used by lemma C.1).

Then, if agent j’s code has δ probability for involving in communication, and agent j’s regret REGj(T)A, via this reduction, we can obtain an algorithm with expected regret


By lemma 2, REG(T) cannot have a regret upperbound better than T(K-1)/75. Therefore


If 38δ<1/75, we can show that A=Ω(KT). In our case, let c=3000 will suffice. Then, since we can show this for M/2 agents, we can show that total regret is Ω(MKT). ∎