Abstract
We study the communication complexity of distributed multiarmed bandits(MAB) and distributed linear bandits for regret minimization. We proposecommunication protocols that achieve nearoptimal regret bounds and result inoptimal speedup under mild conditions. We measure the communication cost ofprotocols by the total number of communicated numbers. For multiarmed bandits,we give two protocols that require little communication cost, one isindependent of the time horizon $ T $ and the other is independent of thenumber of arms $ K $. In particular, for a distributed $K$armed bandit with$M$ agents, our protocols achieve nearoptimal regret $O(\sqrt{MKT\log T})$with $O\left(M\log T\right)$ and $O\left(MK\log M\right)$ communication costrespectively. We also propose two protocols for $d$dimensional distributedlinear bandits that achieve nearoptimal regret with $O(M^{1.5}d^3)$ and$O\left((Md+d\log\log d)\log T\right)$ communication cost respectively. Thecommunication cost can be independent of $T$, or almost linear in $d$.
Quick Read (beta)
Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret
Abstract
We study the communication complexity of distributed multiarmed bandits (MAB) and distributed linear bandits for regret minimization. We propose communication protocols that achieve nearoptimal regret bounds and result in optimal speedup under mild conditions. We measure the communication cost of protocols by the total number of communicated numbers. For multiarmed bandits, we give two protocols that require little communication cost, one is independent of the time horizon $T$ and the other is independent of the number of arms $K$. In particular, for a distributed $K$armed bandit with $M$ agents, our protocols achieve nearoptimal regret $O(\sqrt{MKT\mathrm{log}T})$ with $O\left(M\mathrm{log}T\right)$ and $O\left(MK\mathrm{log}M\right)$ communication cost respectively. We also propose two protocols for $d$dimensional distributed linear bandits that achieve nearoptimal regret with $O({M}^{1.5}{d}^{3})$ and $O\left((Md+d\mathrm{log}\mathrm{log}d)\mathrm{log}T\right)$ communication cost respectively. The communication cost can be independent of $T$, or almost linear in $d$.
ltexReferences \newcolumntype"@
1 Introduction
Bandit learning is a central topic in online learning, and has been applied to various realworld tasks including clinical trials [28], model selection [21], recommendation systems [3, 19, 2], etc. In many tasks using bandits, it may be appealing to employ more agents to learn collaboratively and concurrently in order to speed up the learning process. In some other tasks, it is only natural to consider a distributed setting; for instance, multiple geographically separated labs may be working on a same clinical trial. In such distributed applications, communication between agents is critical, but may also be expensive or timeconsuming. This motivates us to consider efficient protocols for distributed learning in bandit problems.
A straightforward communication protocol would be immediate sharing: broadcasting every new sample immediately. Under this scheme, agents can have performance close to that in a centralized setting. However in this case, the amount of communicated data is directly proportional to the total size of collected data. When the bandits is played for a long timescale, or when the size of a single sample is large, the cost of communication would render this scheme impractical. On the other hand, if agents do not communicate at all, their performance is guaranteed to be unsatisfactory. A natural question to ask is: how does the cost of communication and learning performance tradeoff?
The bandit learning problems that we consider include stochastic multiarmed bandits (MAB) and stochastic linear bandits. In the multiarmed bandits setting, the agent is given $K$arms, each of which associated with a fixed reward distribution. The goal of the agent is to obtain highest possible reward within $T$ total pulls and thereby minimize regret. The linear bandits setting is an extension of MAB. Each arm is a vector $x\in {\mathbb{R}}^{d}$, and the expected reward for arm $x$ is linear with respect to $x$. We consider regret minimization in distributed MAB as regret is arguably the most common performance measure for bandit problems. On the other hand, regret bounds can be reduced to sample complexity bounds for identifying $\u03f5$optimal arms with high probability.
In this paper, we study the distributed learning of both stochastic multiarmed bandits and stochastic linear bandits. In particular, $M$ agents will interact with the same bandit instance in a synchronized fashion. We use the worstcase total regret of $M$ agents in $T$ timesteps as the performance criterion, which we compare with that of a centralized algorithm running for $MT$ timesteps. In particular, for each task, we propose and analyze two protocols with different styles. Protocol 1 and Protocol 3 deal with distributed MAB, while Protocol 2 and Protocol 4 are designed for linear bandits. Protocol 1 and Protocol 2 are inspired by the UCB algorithm [6], and more generally, the optimism in the face of uncertainty principle. Protocol 3 and 4 are based on the phasedelimination algorithm for MAB [7] and for linear bandits [18].
We find that in both tasks, agents can achieve regret bounds that differ from the optimum by only logarithmic factors using very little communication. More specifically, for a $K$armed bandits, only $O(M\mathrm{min}\{\mathrm{log}T,K\mathrm{log}M\})$ numbers need to be communicated to achieve nearoptimal regret. For a $d$dimensional linear bandit, $O(\mathrm{min}\{{M}^{1.5}{d}^{3},Md\mathrm{log}\mathrm{log}d\cdot \mathrm{log}T\})$ numbers need to be communicated. This suggest that agents may not need to trade performance for communicationefficiency: communication is quite efficient even when regret is nearoptimal. In fact, we also provide a communication lower bound for both problems, which suggests that even if we are willing to accept a slightly worse regret, not much can be gained in communication complexity, especially in the MAB problem.
1.1 Problem Setting
Communication Model
The communication model we employ here consists of a server and several agents. The agents can communicate with the server via uploading or receiving data packets, but they are not allowed to boardcast messages directly to all other agents. We assume that the communication between the agents and the server have zero latency; that is, agents and the server can perform arbitrary amount of communication between time steps. Note that protocols in our model can be easily adopted to a serverless network, by designating an agent as the server.
The data packets communicated between the server and agents contain integers or real numbers. We define the cost of sending an integer or a real number to be $O(1)$. The communication complexity of a protocol is defined as the total cost throughout its execution (i.e. total number of communicated integers and real numbers). We expect that for our protocols, communication complexity would only differ from the total number of communicated bits by a at most logarithmic factor, as $O\left(\mathrm{log}(KMT)\right)$ (or $O\left(\mathrm{log}(dMT)\right)$) bits for each real number should provide enough precision.^{1}^{1} 1 We believe this is reasonable since the communicated real numbers are continuous functions of local variables, and are only used to evaluate continuous functions. Informally speaking, we do not communicate “unnatural” real numbers, such as encoding several numbers into one.
Distributed Multiarmed Bandits
In distributed multiarmed bandits, there are $M$ agents, labelled by $1$,…,$M$. Each agent is given access to the same stochastic $K$armed bandits instance. Each arm $i$ in the instance is related to a distribution ${\mathcal{P}}_{i}$. ${\mathcal{P}}_{i}$ is supported on $[0,1]$ with mean $\mu (i)$. At any time step $t=1,2,3,\mathrm{\dots},T$, each agent chooses an arm ${a}_{t,i}$ and pulls this arm, then he would immediately obtain a reward ${r}_{t,i}$ independently sampled from ${\mathcal{P}}_{{a}_{t,i}}$. The goal of each agent is to minimize their cumulative regret, which is defined as
$$REG(T)=\sum _{t=1}^{T}\sum _{i=1}^{M}\left(\underset{j}{\mathrm{max}}\mu (j)\mu ({a}_{i,t})\right).$$ 
Distributed Linear Bandits
In distributed linear bandits, the agents are given access to the same $d$dimension stochastic linear bandits instance. In particular, agents are given the same action set $\mathcal{D}\subseteq {\mathbb{R}}^{d}$ from time to time. At time step $t$, agent $i$ may choose action ${x}_{t,i}\in \mathcal{D}$ and observe reward ${y}_{t,i}$. We assume that the mean of his reward is decided by an unknown parameter ${\theta}^{*}\in {\mathbb{R}}^{d}$: ${y}_{t,i}={x}_{t,i}^{T}{\theta}^{*}+{\eta}_{t,i}$, where ${\eta}_{t,i}\in [1,1]$ have zero mean and are independent with each other. ${\theta}^{*}$ is the same for all agents. We assume ${\parallel {\theta}^{*}\parallel}_{2}\le S$. For distributed linear bandits, cumulative regret is defined as the sum of individual agent’s regrets:
$$REG(T)=\sum _{t=1}^{T}\sum _{i=1}^{M}\left(\underset{x\in \mathcal{D}}{\mathrm{max}}{x}^{T}{\theta}^{*}{x}_{t,i}^{T}{\theta}^{*}\right).$$ 
In both distributed multiarmed bandits and distributed linear bandits task, our goal is to achieve an upper bound on the minimax regret that is close to that of a centralized algorithm (which is clearly the optimum) using as little communication as possible.
We are mainly interested in the case where $T$ is the dominant factor (compared to $M$). Unless otherwise stated, we assume that $T>M\mathrm{log}M+K$ in the multiarmed bandits case and that $T>M$ in the linear bandits case.
1.2 Our Results
We now briefly discuss our protocols and our algorithmic results. We wish to achieve nearoptimal regret guarantee with as little communication as possible. We consider a naive baseline solution first, namely immediate sharing: each agent broadcasts the taken arm and the reward immediately. This solution achieves nearoptimal regret in both the MAB and the linear bandits task ($\stackrel{~}{O}(\sqrt{MKT})$ and $\stackrel{~}{O}(d\sqrt{MT})$ respectively), at the cost of high communication complexities ($O\left(MT\right)$ and $O\left(MTd\right)$ respectively). Our goal is to achieve nearoptimal regret with low communication complexity: $\stackrel{~}{O}(\sqrt{MKT})$ regret in distributed MAB and $\stackrel{~}{O}\left(d\sqrt{MT}\right)$ regret in distributed linear bandits. ^{2}^{2} 2 Another trivial protocol is to run independent single agent algorithms with no communication at all. This approach has regret linear in $M$ ($\mathrm{\Omega}\left(M\sqrt{TK}\right)$ and $\mathrm{\Omega}\left(dM\sqrt{T}\right)$ for MAB and linear bandits respectively).
We summarize our results in Table 1, that compares different algorithms in terms of regret and communication. Our protocols can be separated into two categories: optimismbased protocols and eliminationbased protocols.
Setting  Algorithm  Regret  Communication 

Multiarmed bandits  Immediate Sharing  $O\left(\sqrt{MKT\mathrm{log}T}\right)$  $O(MT)$ 
Optimism (Sec. 3.1)  $O\left(\sqrt{MKT\mathrm{log}T}\right)$  $O(MK\mathrm{log}M)$  
Elimination (Sec. 4.1)  $O\left(\sqrt{MKT\mathrm{log}T}\right)$  $O(M\mathrm{log}T)$  
Lower bound (Sec. 5.1)^{3}^{3} 3 Here, by “$o\left(M\sqrt{KT}\right)$ regret and $\mathrm{\Omega}(M)$ communication”, we mean that for some fixed $c$, if communication complexity is less than $M/c$, total regret is necessarily $\mathrm{\Omega}\left(M\sqrt{KT}\right)$. See Theorem 5 for details.  $o\left(M\sqrt{KT}\right)$  $\mathrm{\Omega}(M)$  
Linear bandits  Immediate Sharing  $O\left(d\sqrt{MT\mathrm{log}T}\right)$  $O(MTd)$ 
Optimism (Sec. 3.2)  $O\left(d\sqrt{MT}{\mathrm{log}}^{2}T\right)$  $O\left({M}^{1.5}{d}^{3}\right)$  
Elimination (Sec. 4.2)  $O\left(d\sqrt{TM\mathrm{log}T}\right)$  $O\left(\left(Md+d\mathrm{log}\mathrm{log}d\right)\mathrm{log}T\right)$  
Optimismbased Protocols:
The optimism in the face of uncertainty principle is widely implemented in the design of online learning algorithm. UCB algorithm [6] for multiarmed bandits and OFUL algorithm [1] for linear bandits are two celebrated implementations. We extend these algorithms to distributed setting, and achieve near optimal regret for both multiarmed bandits and linear bandits. A surprising result is that our communication complexity in both tasks is independent of the time horizon $T$.
Eliminationbased Protocols:
Phased elimination algorithms [18] is another category of methods for best arm identification and regret minimization problem. They run in episodes (each episode consists of multiple time steps): In each episode, the agent pulls each arm from the living action set (i.e. actions that are not yet eliminated) for even times, and eliminate actions with low estimated rewards at the end of the episode. We extend phased elimination algorithm to distributed setting. Though with additional $\mathrm{log}T$ factor, the dependence of other parameters ($M,K,d$) in communication complexity is better than that of optimismbased protocols in both the MAB setting and the linear bandits setting. In particular, in the MAB setting, the dependence on $K$ is removed.
Lower Bound for Multiarmed Bandits:
We also provide a lower bound of communication complexity in order to achieve nontrivial regret (i.e. $o\left(M\sqrt{KT}\right)$) regret for distributed MAB. We show that an expected communication complexity of $\mathrm{\Omega}(M)$ is necessary. This matches the communication upper bound of Protocol 3 except for a $\mathrm{log}T$ factor.
2 Related Work
Multiarmed Bandits
In multiarmed bandits, there are two main categories of algorithms: regret minimization and pure exploration. In the regret minimization setting [6, 9], the agent aims to maximize its cumulative reward in $T$ steps (or, equivalently, minimize regret), for which it is necessary to balance exploration and exploitation. For $K$armed bandits, known algorithms have regret $O\left(\sqrt{KT}\right)$ [4], which matches the lower bound. In the literature of pure exploration [5, 15], the goal of the agent is to identify the best arm (or $\u03f5$optimal arms) with high probability using as few samples as possible.
Linear Bandits
A natural extension of multiarmed bandits is linear bandits. The action set may be either stationary [13] or timedependent [1, 12, 20]. When the action set has size $K$, [20] proved a regret upper bound of $poly(\mathrm{log}\mathrm{log}KT)O\left(dT\mathrm{log}T\mathrm{log}K\right)$, and a lower bound of $\mathrm{\Omega}\left(dT\mathrm{log}T\mathrm{log}K\right)$, assuming $K\le {2}^{d/2}$. The authors of [1] considered the case where the action set may be infinite; they proved a regret upperbound of $O\left(d\mathrm{log}T\sqrt{T}\right)$; a corresponding regret lowerbound is $\mathrm{\Omega}\left(d\sqrt{T\mathrm{log}T}\right)$, due to [20].
Distributed Bandits
Multiagent bandit problems of various kinds have been considered in previous works [17, 14, 11, 26, 10, 25]; most of them have a setting different from ours. [14] considers a best arm identification model in multiagent multiarmed bandits problem. They proposed a phasedelimination algorithm which achieves nearoptimal sample complexity to identify $\u03f5$optimal arms using communication logarithmic in $1/\u03f5$. [25] employs a P2P network for communication, in which an agent can communicate with only two other agents at a time and their $\u03f5$greedy based strategy incurs linear communication cost with time horizon $T$. [10] also proposed an $\u03f5$greedy based policy with communication linear in $\mathrm{log}T/{\mathrm{\Delta}}^{2}$ where $\mathrm{\Delta}$ is the optimality gap (i.e. the difference between the mean reward of the best arm and the mean reward of the second best one). They claimed that their protocol achieves nearoptimal asymptotic regret bounds.
Our work differ from most previous works in two important ways. First, we measure the performance of protocols by worstcase regret; in contrast, many previous works [11, 10] consider asymptotic regret bounds, while others [14, 26] consider pure exploration. Second, we employ a communication model with a central server, while communications in previous works are done via broadcasting [10, 26, 14]. The server could coordinate the exploration of agents based on all the information he received. However, it should be noted that a physical server in the network is not necessary, since an arbitrary agent can be designated as the server.
We noticed there is a concurrent work of Tao et al. [26] which considered distributed multiarmed bandits in a pure exploration setting. They considered two variants of best arm identification: fixedtime (maximize the probability to find the best arm when the time budget is limited) and fixedconfidence (minimize the time steps to find the best arm under fixed confidence). Another difference lies in their messagepassing model: in each communication round, an agent is allowed to broadcast one message. The number of communication rounds is then considered as the measure for communication cost.
As a comparison, although it’s not clear how their fixedtime protocol can be related to ours, we note that our protocols for regret minimization can be used to identify $\u03f5$optimal arms in their fixedconfidence setting. If we set $\mathrm{\Delta}=\u03f5$, this becomes best arm identification. If we run our algorithms for $T=\stackrel{~}{O}(1/{\mathrm{\Delta}}^{2})$ steps, we are able to identify the best arm with high probability and optimal speedup (under mild conditions ^{4}^{4} 4 e.g. all the agents runs almost equally fast). In particular, our eliminationbased protocol gives a distributed best arm identification algorithm with $O(\mathrm{log}T)=\stackrel{~}{O}(\mathrm{log}(1/\mathrm{\Delta}))$ rounds of communication, which is comparable to their results since their protocol also requires $O(\mathrm{log}(1/\mathrm{\Delta}))$ rounds of communication. Further, our optimismbased protocol induces an algorithm with constant communication cost (i.e. independent of $\mathrm{\Delta}$).
3 Optimismbased Protocols
In this section, we propose our optimismbased protocols for multiarmed bandits and linear bandits. Our protocol for multiarmed bandits is based on UCB algorithm. For linear bandits, we extend OFUL algorithm [1] to distributed case.
3.1 A UCBbased Protocol for Multiarmed Bandits
In classic multiarmed bandit problems, upper confidence bound (UCB) algorithms are very efficient in solving regret minimization problem. We show that a natural extension of UCB works well on distributed MAB. In particular, this UCBbased protocol achieves nearoptimal regret bound $O(\sqrt{MKT\mathrm{log}T})$ with communication complexity $O(MK\mathrm{log}M)$–a surprising result that communication complexity is independent of the time horizon $T$.
3.1.1 Protocol Sketch
We first take a glance at the UCB algorithm in singleagent setting [6]. The reason why this algorithm works well is that the agent form tighter estimates about the mean reward of each arm as he obtains more samples about that arm. The regret incurred by pulling an arm is bounded by the level of confidence of that arm. In particular, if the agent has $N$ independent samples for some arm, then he can form a high confidence interval of length $\stackrel{~}{O}\left(\sqrt{1/N}\right)$ for it, so he can safely pull this arm for another $N$ times without any additional information, without hurting the regret too much.
We take use of the same idea in distributed setting: the server doesn’t have to initiate a communication about an arm until the total number of pulls for this arm among all agents is doubled. However, the problem is that the server cannot keep track of each agent online in order to know the actual pulls of this arm, since the communication complexity will be linear in $MT$ if so. Therefore, we propose a protocol to enable the server keeping track of the lower bounds of the actual pulls using much less communication.
Concretely, for a fixed arm $k$, suppose now arm $k$ has been pulled for $N$ times among all agents, so the server needs to initiate a communication after arm $k$ is pulled for another $N$ times. Let $l:=\lceil {\mathrm{log}}_{2}(N/M)\rceil ,r:=\lceil {\mathrm{log}}_{2}N\rceil $, any agent will send a short message to the server (constant bits) whenever he has pulled arm $k$ for another ${2}^{i}(l\le i\le r)$ times. The server can thereby maintain a lower bound on the number of pulls for arm $k$, and initiates a communication as long as this lower bound exceeds $N$. On the other hand, the total number of pulls will also be $O(N)$ when the communication starts if the agents communicate in this way. It’s easy to observe that the total communication cost in this communication round is at most $O(M(rl))=O(M\mathrm{log}M)$, and further more, we can show that the communication in this episode is actually $O(M)$. We will show afterwards that the total communication rounds will be $O(\mathrm{log}M)$, so the total communication complexity will be $O(MK\mathrm{log}M)$ since there are $K$ arms.
3.1.2 Protocol Description
First of all, we will introduce some notations.

•
${N}^{p}(k)$: the number of times arm $k$ is actually taken by agents that has been shared before episode $p$.

•
${v}^{p}(k)$: empirical average of the ${N}^{p}(k)$ observations.

•
${N}_{i}(k)$: the number of times $k$ is taken not yet forwarded to the server by agent $i$.

•
${v}_{i}(k)$: empirical average corresponding to the ${N}_{i}(k)$ observations.
We use superscript $(p)$ to denote the value of a variable at the $p$th episode. Define
$$UCB(k)=\frac{{N}^{p}(k){v}^{p}(k)+{N}_{i}(k){v}_{i}(k)}{{N}^{p}(k)+{N}_{i}(k)}+\sqrt{\frac{16\mathrm{log}T}{{N}^{p}(k)+{N}_{i}(k)}}$$ 
We define
$${N}_{max}(p)=\{\begin{array}{cc}D\hfill & p=0\hfill \\ D\cdot {2}^{p1}\hfill & p\ge 1\hfill \end{array}$$ 
For arm $k$ in episode ${p}_{k}$, if the server finds that the total number of pulls among all agents in this episode is equal to or greater than ${N}_{max}({p}_{k})$, then all agents synchronize information of arm $k$ and ${p}_{k}={p}_{k}+1$. $D$ is a parameter that will be determined in the analysis. {algorithm}[H] \[email protected]@algorithmic[5] \For$k=1,\mathrm{\dots},K$ \Stateset ${p}_{k}=0$ \Stateset ${N}_{i}(k)=0$ for $i=1,\mathrm{\dots},M$ \EndFor\WhileTrue \StateReceive ${N}_{i}(k)$ from agent i. \StateCalculate the sum of ${N}_{i}(k)$ for all agents as $\widehat{N}(k)$ \If$\widehat{N}(k)\ge {N}_{max}({p}_{k})$ \StateSend signal $k$ to agents, \State${p}_{k}={p}_{k}+1$ \WhileNot all agents send their local estimates to the server \StateUpon receiving $(k,{N}_{i}(k),{V}_{i}(k))$, update ${N}^{{p}_{k}}(k)$ and ${v}^{{p}_{k}}(k)$ accordingly \EndWhile\Statesend ${N}^{{p}_{k}}(k)$ and ${v}^{{p}_{k}}(k)$ to agents. \EndIf\EndWhile
[H]
\[email protected]@algorithmic[5] \For$k=1,\mathrm{\dots},K$ \Stateset ${p}_{k}=0$ \EndFor\WhileTrue \StateTake action $k$ with largest $UCB(k)$ \StateObserve reward, update ${N}_{i}(k)$ and ${v}_{i}(k)$ \If${N}_{i}(k)\in \{{2}^{j}:j\in \mathbb{N}\}$ and ${N}_{i}(k)\ge {N}_{max}({p}_{k})/M$ for some $k$ \StateSend ${N}_{i}(k)$ to server \EndIf\Ifreceive signal $(k)$ from server \StateSend $(k,{N}_{i}(k),{v}_{i}(k))$ to server, set ${N}_{i}(k)$ and ${v}_{i}(k)$ to $0$. \State${p}_{k}={p}_{k}+1$ \StateUpon receiving $(k,N(k),V(k))$, update local ${N}^{{p}_{k}}(k)$ and ${v}^{{p}_{k}}(k)$ accordingly \EndIf\EndWhile
3.1.3 Analysis
We will state some useful lemmas first.
Lemma 3.1.
For any $p\mathrm{\ge}\mathrm{0}$, and any arm $k\mathrm{\in}\mathrm{[}K\mathrm{]}$,
$${N}_{max}(p)\le {N}^{p}(k)$$ 
and for any $p\mathrm{\ge}\mathrm{1}$,
$${N}^{p}(k){N}^{p1}(k)\le 6{N}_{max}(p1)$$ 
Therefore,
$${N}^{p}(k)\le 6{N}_{max}(p)$$ 
Proof.
Note that ${N}^{p}(k){N}^{p1}(k)$ is the actual number of pulls for arm $k$ among all agents in episode $p1$ (the episodes are numbered from 0). It’s easy to observe ${N}^{p}(k){N}^{p1}(k)\ge {N}_{max}(p1)$ by our protocol, so ${N}^{p}(k)\ge {\sum}_{i=1}^{p1}{N}_{max}(i)={N}_{max}(p)$.
To upper bound ${N}^{p}(k){N}^{p1}(k)$, we divide the agents into two groups:
The first group contains agents that do not send any notifications to the server before the server tells him to communicate in episode $p1$. Let ${P}_{0}$ be the total number of pulls for arm $k$ by the first group. Obviously each agent in the first group pulls arm $k$ for at most $2{N}_{max}(p1)/M$ times, so ${P}_{0}\le 2{N}_{max}(p1)$. The second group, on the other hand, contains agents that send messages to the server at least once. Let ${P}_{1}$ be the total number of pulls for arm $k$ by this group. Consider the last communication in episode $p1$ before the server decides to initiate a communication round. Denote the value of $\widehat{N}(k)$ before this communication by ${C}_{0}$, and the value of $\widehat{N}(k)$ after this communication by ${C}_{1}$. Observe that $$ and ${P}_{1}\le 2{C}_{1}$. Therefore, ${P}_{1}\le 4{N}_{max}(p1)$.
Finally, we have ${N}^{p}(k){N}^{p1}(k)={P}_{0}+{P}_{1}\le 6{N}_{max}(p1)$, which implies ${N}^{p}(k)\le 6{N}_{max}(p1)$.
∎
Using AzumaHoeffding’s inequality and union bound, we can prove the following lemma.
Lemma 3.2.
Throughout the algorithm, if there are $m$ samples for arm $k$, with probability at least $\mathrm{1}\mathrm{}M\mathrm{/}{T}^{\mathrm{3}}$,
$$ 
where $c$ is some constant.
Furthermore, we can bound the communication cost in a single episode.
Lemma 3.3.
The communication complexity in one episode is $O\mathit{}\mathrm{(}M\mathrm{)}$.
Proof.
For any arm $k$, suppose the server needs to initiate a communication after this arm get pulled for another $N$ times. It’s easy to observe that the total communication in updating ${N}^{p}(k)$ and ${v}^{p}(k)$ is $O(M)$. Now we turn to the communication used for the server to keep track of $\widehat{N}(k)$.
Let $l:=\lceil {\mathrm{log}}_{2}(N/M)\rceil ,r:=\lceil {\mathrm{log}}_{2}N\rceil $, then the agents will send a notification to the server if he has pulled arm $k$ for ${2}^{i}$ times where $i=l,l+1,\mathrm{\dots},r$. The server maintains a counter $\widehat{N}(k)$, which calculates the total number of pulls that the agents have reported to him. If server finds that $\widehat{N}(k)\ge N$, he will initiate a communication.
If the server received a notification claiming some agent has pulled arm $k$ for ${2}^{i}$ times, then $\widehat{N}(k)$ will increase by ${2}^{i1}$ (if $i>l$) or ${2}^{i}$ (if $i=l$). This means whenever the server received a message, $\widehat{N}(k)$ will increase by at least ${2}^{l}$. Therefore, the communication cost in this episode is at most $N/{2}^{l}+O(M)=O(M)$. ∎
Theorem 1.
We can achieve expected regret $O\mathit{}\mathrm{\left(}\sqrt{K\mathit{}M\mathit{}T\mathit{}\mathrm{log}\mathit{}T}\mathrm{\right)}$ with communication complexity $O\mathit{}\mathrm{(}M\mathit{}K\mathit{}\mathrm{log}\mathit{}M\mathrm{)}$.
Proof.
Regret: Let ${N}_{k}$ be the total number of pulls for arm $k$ in $MT$ steps. We use $\stackrel{~}{L}$ to hide constants and $\mathrm{log}T$.
The expected regret caused by the failure of lemma 3.2 is at most $MT\cdot 1/(M{T}^{3})=O(1)$, thus we assume lemma 3.2 holds throughout our algorithm.
We divide $REG(T)$ into two parts $REG(T)=RE{G}_{1}(T)+RE{G}_{2}(T)$. $RE{G}_{1}(T)$ denotes the regret incurred by all the arms when they are in episode $0$, $RE{G}_{2}(T)$ denotes the regret incurred in episodes $>0$.
Observe that for any arm, the regret incurred in episode 0 is at most $O(\sqrt{MD})$. The worst case is that the $O(D)$ pulls (at most $6D$ by lemma 3.1) in episode $0$ are distributed evenly among all agents. Thus, $RE{G}_{1}(T)\le O(K\sqrt{MD})$.
For any arm $k$ with ${N}_{k}>D$, the total number of episodes for this arm will be $\lceil {\mathrm{log}}_{2}({N}_{k}/D)\rceil $. According to our upper confidence bound nature, we can bound $RE{G}_{2}(T)$ as follows:
$RE{T}_{2}(T)$  $\le {\displaystyle \sum _{k=1}^{K}}{\displaystyle \sum _{p=1}^{\lceil {\mathrm{log}}_{2}({N}_{k}/D)\rceil}}\sqrt{{\displaystyle \frac{\stackrel{~}{L}}{{N}^{p}(k)}}}({N}^{p+1}(k){N}^{p}(k))$  
$=O({\displaystyle \sum _{k=1}^{K}}{\displaystyle \sum _{p=1}^{\lceil {\mathrm{log}}_{2}({N}_{k}/D)\rceil}}\sqrt{D\cdot {2}^{p}\stackrel{~}{L}})$  
$=O({\displaystyle \sum _{k=1}^{K}}\sqrt{{N}_{k}\mathrm{log}T})$  
$\le O(\sqrt{KMT\mathrm{log}T)}$ 
The first equality is by lemma 3.1.
Therefore, we have
$$REG(T)=RE{G}_{1}(T)+RE{G}_{2}(T)\le O(K\sqrt{MD})+O(\sqrt{KMT\mathrm{log}T)}$$ 
Choose $D=T/K$. Regret is $O(\sqrt{KMT\mathrm{log}T})$.
Communication: Using lemma 3.3, communication complexity can be bounded by
$O(M{\displaystyle \sum _{k=1}^{K}}\mathrm{log}{\displaystyle \frac{{N}_{k}}{D}})$  $\le O(MK\mathrm{log}({\displaystyle \sum _{k}}{N}_{k}/KD))$  
$=O(MK\mathrm{log}M)$ 
∎
3.2 An OFULbased Protocol for Linear Bandits
In this subsection, we propose a distributed algorithm based on OFUL algorithm [1]. We show that our protocol can achieve $O(d\sqrt{MT}{\mathrm{log}}^{2}\left(T\right))$ regret with communication complexity $O({M}^{1.5}{d}^{3})$. Similar to that of our UCBbased protocol for multiarmed bandits, the communication complexity of this protocol is independent of the time horizon $T$. Note that though action set $\mathcal{D}$ does not change with time in our basic setting, our results can be directly applied to the case with timevariant action set ${\mathcal{D}}_{t}$.
3.2.1 Protocol Sketch
{algorithm}[H]
[1] \For$t:=1,2,\mathrm{\dots}$ \State$({x}_{t},{\stackrel{~}{\theta}}_{t})={\mathrm{argmax}}_{(x,\theta )\in \mathcal{D}\times {\mathcal{C}}_{t1}}\u27e8x,\theta \u27e9$ \StatePlay ${x}_{t}$ and observe reward ${y}_{t}$ \Stateupdate ${\mathcal{C}}_{t}$ \EndFor
To begin with, we review the OFUL algorithm [1], which is based on the optimism in the face of uncertainty (OFU) principle. The core idea is to maintain a confidence set ${\mathcal{C}}_{t1}\subseteq {\mathbb{R}}^{d}$ for the parameter ${\theta}^{*}$. In each step, the algorithm chooses an optimistic estimate ${\stackrel{~}{\theta}}_{t}={\mathrm{argmax}}_{\theta \in {\mathcal{C}}_{t1}}\left({\mathrm{max}}_{x\in \mathcal{D}}\u27e8x,\theta \u27e9\right)$ and then chooses action ${X}_{t}={\mathrm{argmax}}_{x\in \mathcal{D}}\u27e8x,{\stackrel{~}{\theta}}_{t}\u27e9$, which maximizes the reward according to the estimate ${\stackrel{~}{\theta}}_{t}$. ${\mathcal{C}}_{t}$ is calculated from previous actions (${x}_{1},{x}_{2},\mathrm{\dots},{x}_{t}$) and rewards (${y}_{1},{y}_{2},\mathrm{\dots},{y}_{t}$):
$${\mathcal{C}}_{t}=\{\theta \in {\mathbb{R}}^{d}:{\parallel \widehat{\theta}\theta \parallel}_{{\overline{V}}_{t}}\le \sqrt{2\mathrm{log}\left(\frac{\mathrm{det}{\left({\overline{V}}_{t}\right)}^{1/2}\mathrm{det}{(\lambda I)}^{1/2}}{\delta}\right)}+{\lambda}^{1/2}S\},$$  (1) 
where ${\overline{V}}_{t}=\lambda I+{\sum}_{\tau =1}^{t}{x}_{\tau}{x}_{\tau}^{T}$ and $\widehat{\theta}={\left(\lambda I+{\sum}_{\tau =1}^{t}{x}_{\tau}{x}_{\tau}^{T}\right)}^{1}{\sum}_{\tau =1}^{t}{x}_{\tau}{y}_{\tau}$.
Our observation is that the volume of the confidence ellipsoid depends on $det({\overline{V}}_{t})$. If $det({\overline{V}}_{t})$ does not vary greatly, it will not influence the confidence guarantee even if the confidence ellipsoid is not updated. We divide all steps into several epochs and only synchronize samples at the end of each epoch. As $det({\overline{V}}_{T})$ is upper bounded, the number of those bad epochs, in which $det({\overline{V}}_{t})$ varies greatly, is limited. This inspires our analysis. Meanwhile, in our algorithm, we also use $det({\overline{V}}_{t})$ as part of our synchronization condition.
3.2.2 Protocol Description
In our distributed algorithm, each agent uses all samples available for him to compute the confidence ellipsoid. These samples are composed of two parts: samples obtained before last synchronization and samples that haven’t been uploaded. Once the condition of synchronization is satisfied for at least one agent, all agents send data to the server, and the server returns processed data.
We denote ${\sum}_{\tau}{x}_{\tau}{x}_{\tau}^{T}$ and ${\sum}_{\tau}{x}_{\tau}{y}_{\tau}$ as $W$ and $U$ in our algorithm respectively. For the sake of communication and computation efficiency, we only transmit $W$ and $U$ in communication rounds.
[H]
\[email protected]@algorithmic[5] \State${W}_{syn}=0$, ${U}_{syn}=0$ \State${t}_{last}=0$ \WhileTrue \StateWait until receiving synchronization signal from an agent \StateSend uploading signals to all agents \StateReceive ${W}_{new},{U}_{new}$ from agent $i$ as ${W}_{new,i},{U}_{new,i}$. \StateCalculate ${W}_{syn}={W}_{syn}+{\sum}_{i=1}^{M}{W}_{new,i}$, ${U}_{syn}={U}_{syn}+{\sum}_{i=1}^{M}{U}_{new,i}$ \StateSend ${W}_{syn}$ and ${U}_{syn}$ to all agents. \EndWhile
[H]
\[email protected]@algorithmic[5] \State${W}_{syn}=0$, ${U}_{syn}=0$ \State${W}_{new}=0$, ${U}_{new}=0$ \State${t}_{last}=0$, ${V}_{last}=\lambda I$ \For$t=1,\mathrm{\dots},T$ \StateComputes the confidence ellipsoid using Eq.1 based on ${W}_{syn}+{W}_{new}$ and ${U}_{syn}+{U}_{new}$. Denote his results as ${V}_{t,i}$, ${\widehat{\theta}}_{t,i}$ and ${\mathcal{C}}_{t,i}$ \State$({x}_{t,i},\stackrel{~}{\theta})=\mathrm{arg}{\mathrm{max}}_{(x,\theta )\in \mathcal{D}\times \mathcal{C}}\u27e8x,\theta \u27e9$ \Stateagent $i$ plays ${x}_{t,i}$ and gets the result ${y}_{t,i}$ \State${W}_{new}={W}_{new}+{x}_{t,i}{x}_{t,i}^{T}$, ${U}_{new}={U}_{new}+{x}_{t,i}{y}_{t,i}$. \If$\mathrm{log}\left(det{V}_{t,i}/det{V}_{last}\right)\cdot (t{t}_{last})>D$ \StateSend a synchronization signal to the server to initiate a communication round. \EndIf\Ifreceiving uploading signal from the server \StateUpload ${W}_{new},{U}_{new}$ \StateReceive ${W}_{syn},{U}_{syn}$ from the server \State${W}_{new}=0$, ${U}_{new}=0$ \State${t}_{last}=t$, ${V}_{last}=\lambda I+{W}_{syn}$ \EndIf\EndFor $D$ is a parameter that will be determined in our analysis.
3.2.3 Analysis
First of all, we state lemmas from previous results that will be useful in our proof.
Lemma 3.4.
(Theorem 2 in [1]) With high probability, ${\theta}^{\mathrm{*}}$ always lies in the constructed ${\mathrm{C}}_{t\mathrm{,}i}$ for all $t$ and all $i$.
Lemma 3.5.
(Lemma 11 in [1]) Let ${\mathrm{\left\{}{X}_{t}\mathrm{\right\}}}_{t\mathrm{=}\mathrm{1}}^{\mathrm{\infty}}$ be a sequence in ${\mathrm{R}}^{d}$, $V$ is a $d\mathrm{\times}d$ positive definite matrix and define ${\overline{V}}_{t}\mathrm{=}V\mathrm{+}{\mathrm{\sum}}_{s\mathrm{=}\mathrm{1}}^{t}{X}_{s}\mathit{}{X}_{s}^{\mathrm{\top}}$. Then we have that
$$\mathrm{log}\left(\frac{\mathrm{det}\left({\overline{V}}_{n}\right)}{\mathrm{det}(V)}\right)\le \sum _{t=1}^{n}{\parallel {X}_{t}\parallel}_{{\overline{V}}_{t1}^{1}}^{2}$$ 
. Further, if ${\mathrm{\parallel}{X}_{t}\mathrm{\parallel}}_{\mathrm{2}}\mathrm{\le}L$ for all $t$, then ${\mathrm{\sum}}_{t\mathrm{=}\mathrm{1}}^{n}\mathrm{min}\mathit{}\mathrm{\{}\mathrm{1}\mathrm{,}{\mathrm{\parallel}{X}_{t}\mathrm{\parallel}}_{{\overline{V}}_{t\mathrm{}\mathrm{1}}^{\mathrm{}\mathrm{1}}}^{\mathrm{2}}\mathrm{\}}\mathrm{\le}\mathrm{2}\mathit{}\mathrm{\left(}\mathrm{log}\mathit{}\mathrm{det}\mathit{}\mathrm{\left(}{\overline{V}}_{n}\mathrm{\right)}\mathrm{}\mathrm{log}\mathit{}\mathrm{det}\mathit{}V\mathrm{\right)}\mathrm{\le}\mathrm{2}\mathit{}\mathrm{\left(}d\mathit{}\mathrm{log}\mathit{}\mathrm{\left(}\mathrm{\left(}\mathrm{trace}\mathit{}\mathrm{(}V\mathrm{)}\mathrm{+}n\mathit{}{L}^{\mathrm{2}}\mathrm{\right)}\mathrm{/}d\mathrm{\right)}\mathrm{}\mathrm{log}\mathit{}\mathrm{det}\mathit{}V\mathrm{\right)}$, and finally, if ${\lambda}_{\mathrm{min}}\mathit{}\mathrm{(}V\mathrm{)}\mathrm{\ge}\mathrm{max}\mathit{}\mathrm{(}\mathrm{1}\mathrm{,}{L}^{\mathrm{2}}\mathrm{)}$ then
$$\sum _{t=1}^{n}{\parallel {X}_{t}\parallel}_{{\overline{V}}_{t1}^{1}}^{2}\le 2\mathrm{log}\frac{\mathrm{det}\left({\overline{V}}_{n}\right)}{\mathrm{det}(V)}$$ 
Using Lemma 3.4, we can bound single step pseudoregret ${r}_{t,i}$.
Lemma 3.6.
With high probability, single step pseudoregret ${r}_{t\mathrm{,}i}\mathrm{=}\mathrm{\u27e8}{\theta}^{\mathrm{*}}\mathrm{,}{x}^{\mathrm{*}}\mathrm{}{x}_{t\mathrm{,}i}\mathrm{\u27e9}$ is bounded by
$${r}_{t,i}\le 2\left(\sqrt{2\mathrm{log}\left(\frac{det{({\overline{V}}_{t,i})}^{1/2}det{(\lambda I)}^{1/2}}{\delta}\right)}+{\lambda}^{1/2}S\right){\parallel {x}_{t,i}\parallel}_{{\widehat{V}}_{t,i}^{1}}=O\left(\sqrt{d\mathrm{log}\frac{T}{\delta}}\right){\parallel {x}_{t,i}\parallel}_{{\overline{V}}_{t,i}^{1}}.$$ 
Proof.
Assuming ${\theta}^{*}\in {\mathcal{C}}_{t,i}$,
${r}_{t,i}$  $=\u27e8{\theta}^{*},{x}^{*}\u27e9\u27e8{\theta}^{*},{x}_{t,i}\u27e9$  
$\le \u27e8\stackrel{~}{\theta},{x}_{t,i}\u27e9\u27e8{\theta}^{*},{x}_{t,i}\u27e9$  
$=\u27e8\stackrel{~}{\theta}{\theta}^{*},{x}_{t,i}\u27e9$  
$=\u27e8\stackrel{~}{\theta}{\widehat{\theta}}_{t,i},{x}_{t,i}\u27e9+\u27e8{\widehat{\theta}}_{t,i}{\theta}^{*},{x}_{t,i}\u27e9$  
$\le {\parallel \stackrel{~}{\theta}{\widehat{\theta}}_{t,i}\parallel}_{{\overline{V}}_{t,i}}{\parallel {x}_{t,i}\parallel}_{{\overline{V}}_{t,i}^{1}}+{\parallel {\widehat{\theta}}_{t,i}{\theta}^{*}\parallel}_{{\overline{V}}_{t,i}}{\parallel {x}_{t,i}\parallel}_{{\overline{V}}_{t,i}^{1}}$  
$\le 2\left(\sqrt{2\mathrm{log}\left({\displaystyle \frac{det{({\overline{V}}_{t,i})}^{1/2}det{(\lambda I)}^{1/2}}{\delta}}\right)}+{\lambda}^{1/2}S\right){\parallel {x}_{t,i}\parallel}_{{\overline{V}}_{t,i}^{1}}$  
$=O\left(\sqrt{d\mathrm{log}{\displaystyle \frac{T}{\delta}}}\right){\parallel {x}_{t,i}\parallel}_{{\overline{V}}_{t,i}^{1}}.$ 
∎
Theorem 2.
Our protocol can achieve a regret of $O\mathit{}\mathrm{\left(}d\mathit{}\sqrt{M\mathit{}T}\mathit{}{\mathrm{log}}^{\mathrm{2}}\mathit{}\mathrm{(}T\mathrm{)}\mathrm{\right)}$ with $O\mathit{}\mathrm{\left(}{M}^{\mathrm{1.5}}\mathit{}{d}^{\mathrm{3}}\mathrm{\right)}$ communication complexity.
Proof.
Regret: In our protocol, there will be a number of epochs divided by communication rounds. We denote ${V}_{last}$ in epoch $p$ as ${V}_{p}$. Suppose that there are $P$ epochs, then ${V}_{P}$ will be the matrix with all samples included.
Observe that $det{V}_{0}=det(\lambda I)={\lambda}^{d}$. $det({V}_{P})\le {\left(\frac{tr({V}_{P})}{d}\right)}^{d}\le {\left(\lambda +MTL/d\right)}^{d}$. Therefore
$$\mathrm{log}\frac{det({V}_{P})}{det({V}_{0})}\le d\mathrm{log}\left(1+\frac{MTL}{\lambda d}\right).$$ 
Let $R=\lceil d\mathrm{log}\left(1+\frac{MTL}{\lambda d}\right)\rceil $. It follows that for all but $R$ epochs, we have
$$1\le \frac{det{V}_{j+1}}{det{V}_{j}}\le 2.$$ 
We call these epochs good epochs. In these epochs, we can use the argument for theorem 4 in [1]. First, we imagine the $MT$ pulls are all made by one agent in a roundrobin fashion. I.e., he takes ${x}_{1,1}$, ${x}_{1,2}$, …, ${x}_{1,M}$, ${x}_{2,1}$, …, ${x}_{2,M}$, …, ${x}_{T,M}$. We use ${\stackrel{~}{V}}_{t,i}$ to denote the $\overline{V}$ this imaginary agent calculates when he gets to ${x}_{t,i}$. If ${x}_{t,i}$ is in one of those good epochs(say the $p$th epoch), then we can see that
$$1\le \frac{det{\stackrel{~}{V}}_{t,i}}{det{\overline{V}}_{t,i}}\le \frac{det{V}_{j}}{det{V}_{j1}}\le 2.$$ 
Therefore
${r}_{t,i}$  $\le O\left(\sqrt{d\mathrm{log}{\displaystyle \frac{T}{\delta}}}\right)\sqrt{{x}_{t,i}^{T}{\overline{V}}_{t,i}^{1}{x}_{t,i}}$  
$\le O\left(\sqrt{d\mathrm{log}{\displaystyle \frac{T}{\delta}}}\right)\sqrt{{x}_{t,i}^{T}{\stackrel{~}{V}}_{t,i}^{1}{x}_{t,i}\cdot {\displaystyle \frac{det{\stackrel{~}{V}}_{t,i}}{det{\overline{V}}_{t,i}}}}$  
$\le O\left(\sqrt{d\mathrm{log}{\displaystyle \frac{T}{\delta}}}\right)\sqrt{{x}_{t,i}^{T}{\stackrel{~}{V}}_{t,i}^{1}{x}_{t,i}}.$ 
We can then use the argument for the single agent regret bound and prove regret in these good epochs.
For epoch $p$, we denote regret in all good epochs as $RE{G}_{good}$. Suppose ${\mathcal{B}}_{p}$ means the set of $(t,i)$ pairs that belong to epoch $p$, and ${P}_{good}$ means the set of good epochs, using lemma 3.5, we have
$RE{G}_{good}$  $={\displaystyle \sum _{t}}{\displaystyle \sum _{i}}{r}_{t,i}$  
$\le \sqrt{MT{\displaystyle \sum _{p\in {P}_{good}}}{\displaystyle \sum _{(t,i)\in {\mathcal{B}}_{p}}}{r}_{t,i}^{2}}$  
$\le O\left(\sqrt{dMT\mathrm{log}({\displaystyle \frac{T}{\delta}}){\displaystyle \sum _{p\in {P}_{good}}}{\displaystyle \sum _{(t,i)\in {\mathcal{B}}_{p}}}\mathrm{min}({\parallel {x}_{t,i}\parallel}_{{\stackrel{~}{V}}_{t,i}^{1}}^{2},1)}\right)$  
$\le O\left(\sqrt{dMT\mathrm{log}({\displaystyle \frac{T}{\delta}}){\displaystyle \sum _{p\in {P}_{good}}}\mathrm{log}\left({\displaystyle \frac{\mathrm{det}\left({V}_{p}\right)}{\mathrm{det}\left({V}_{p1}\right)}}\right)}\right)$  
$\le O\left(\sqrt{dMT\mathrm{log}({\displaystyle \frac{T}{\delta}})MT\mathrm{log}\left({\displaystyle \frac{\mathrm{det}\left({V}_{P}\right)}{\mathrm{det}\left({V}_{0}\right)}}\right)}\right)$  
$\le O\left(d\sqrt{MT}\mathrm{log}(MT)\right).$ 
Now we focus on epochs that are not good. Consider such a bad epoch where we reindex the matrices for convenience. Suppose at the start we have a ${V}_{last}$. Suppose that the length of the epoch is $n$. Then agent $i$ proceeds as ${V}_{1}^{(i)},\mathrm{\dots},{V}_{n}^{(i)}$. Our argument above tells us that
$$REG(n)\le O\left(\sqrt{d\mathrm{log}T/\delta}\right)\cdot \sum _{i=1}^{M}\sqrt{n\mathrm{log}\frac{det{V}_{n}^{(i)}}{det{V}_{last}}}.$$ 
Now, for all but 1 agent, $$. Therefore we can show that
$$REG(n)\le O\left(\sqrt{d\mathrm{log}T/\delta}\right)\cdot M\sqrt{D}.$$ 
We also know that the number of such epochs are rare. (Less than $R=O(d\mathrm{log}MT)$). Therefore the second part of the regret is
$$RE{G}_{bad}\le O\left(M{d}^{1.5}{\mathrm{log}}^{1.5}MT\right)\cdot {D}^{1/2}.$$ 
If we choose $D=\left(\frac{T\mathrm{log}MT}{dM}\right),$ then $REG(T)=O\left(d\sqrt{MT}{\mathrm{log}}^{2}(MT)\right).$ Since $T>M$, we have
$$REG(T)=O\left(d\sqrt{MT}{\mathrm{log}}^{2}(T)\right).$$ 
Communication: Let $C={\left(\frac{DT}{R}\right)}^{0.5}.$ There could be at most $\lceil T/C\rceil $ such epochs that contains more than $C$ rounds. For those containing less than $C$ steps, $\mathrm{log}\left(\frac{det{V}_{j+1}}{det{V}_{j}}\right)>\frac{D}{C}$. There could be at most $\lceil \frac{R}{D/C}\rceil =\lceil \frac{RC}{D}\rceil $ such epochs. Therefore, the total number of epochs is bounded by
$$\lceil \frac{T}{C}\rceil +\lceil \frac{RC}{D}\rceil =O\left(\sqrt{\frac{TR}{D}}\right).$$ 
With our choice of $D$, this is
$O\left({M}^{0.5}d\right).$ 
At each communication round, we need to pass $O\left(M{d}^{2}\right)$ amount of data. Hence total communication complexity should be
$$O\left({M}^{1.5}{d}^{3}\right).$$ 
∎
4 Eliminationbased Protocols
In this section, we discuss eliminationbased protocols for MAB and linear bandits. Compared with that of optimismbased protocols, the communication complexities of eliminationbased protocols enjoy a better dependence on $K$ (MAB setting) or $M$ and $d$ (linear bandits setting); however, in both cases, an additional $\mathrm{log}T$ factor is introduced.
4.1 An Eliminationbased Protocol for Multiarmed Bandits
In this subsection, we propose a distributed version of the phasedelimination algorithm [7]. We show that it achieves $O\left(\sqrt{MKT\mathrm{log}T}\right)$ regret with communication complexity of $O\left(M\mathrm{log}MKT\right)$. Note that under our usual assumption of $T>M+K$, the dependence on $K$ is removed.
4.1.1 Protocol Sketch
The phasedelimination algorithm [7] works with the following principles: 1. it maintains a set of good arms $A$, which is initially set to be $[K]$; 2. at every phase, each arm within the set is pulled for the same number of times; 3. at the end of each phase, badperforming arms are eliminated, with a margin that halves each phase.
It can be noted that the phasedelimination algorithm seems wellsuited for distributed applications: the task of sampling each arm can be easily parallelized. If the number of remaining arms is greater than $M$, we assign arms to agents (each agent gets multiple arms); if the number of remaining arms is less than $M$, we assign agents to arms (each agent is assigned to one arm). We design our distributed elimination algorithm based on the following additional observations:

1.
If the number of arms is less than $M$, we can easily simulate phased elimination with $O(M)$ communication;

2.
If the number of arms is greater than $M$, but at the end of a phase, the number of remaining arms assigned to each agent remains nearuniform^{5}^{5} 5 By sayiung a sequence of positive numbers is nearuniform, we mean that the maximum is at most $2$ times the minimum., we only need to compute the best arm in this phase ($O(M)$ communication) and proceed to the next phase;

3.
By assigning arms to agents randomly at initialization, the number of remaining arms is likely to remain nearuniform, until the number of remaining arms drops to $O(M\mathrm{log}M)$.
4.1.2 Protocol Description
{algorithm}[H]
\[email protected]@algorithmic[1] \StateInput: $K,M,T$ \StateGenerate $K$ random numbers from uniform distribution on $[M]$; send $K/M$ of these random numbers to each agent (Share Randomness) \State${k}_{0}=K$, ${k}_{max}=\lceil K/M\rceil $, $A=\mathrm{\varnothing}$ \For$l=1,2,3,\mathrm{\dots}$ \StateSend ${k}_{max}$ to each agent\If$$ \StatePartition arms in ${A}_{l}$ to agents evenly; schedule ${m}_{l}=\lceil {4}^{l+2}\mathrm{ln}MKT\rceil $ pulls for each arm \StateWait for all results to return \StateFor each $i\in {A}_{l}$, calculate ${\widehat{\mu}}_{i,l}$: the empirical mean of reward associated to arm $i$ \StateEliminate lowrewarding arms:
$${A}_{l+1}=\{i\in {A}_{l}:{\widehat{\mu}}_{i,l}+{2}^{l}\ge \underset{j\in {A}_{l}}{\mathrm{max}}{\widehat{\mu}}_{j,l}\}$$ 
${k}_{l}={k}_{l1}$ \Else\StateWait for each agent to report $({a}_{j,l}^{*},{\widehat{\mu}}_{j,l}^{*})$ in this phase \StateCalculate ${u}_{l}^{*}={\mathrm{max}}_{j\in [M]}{\widehat{\mu}}_{j,l}^{*}$, and send to every agent \StateWait for each agent to report their remaining number of arms ${n}_{l}(j)$ \If${\mathrm{max}}_{j\in [M]}{n}_{l}(i)>2{\mathrm{min}}_{j\in [M]}{n}_{l}(i)$ \StateRearrange arms to make number of arms nearuniform \EndIf\StateUpdate ${k}_{max}$ to be the maximum number of arms a agent holds; update ${k}_{l}$ to be the total number of remaining arms \If$$ \StateAsk agents to report all remaining arms, gather them in ${A}_{l+1}$ \EndIf\EndIf\StateNotify agents to proceed \EndFor
[H]
\[email protected]@algorithmic[1] \StateInput: $K,M,T$ \State${A}^{(i)}$ is given by the set of random numbers received at the beginning (Shared Randomness) \State${k}_{0}=K$ \For$l=1,2,3,\mathrm{\dots}$ \StateReceive ${k}_{max}$ \If$$ \StateReceive $({A}_{i,l},m)$ from the server \For$j\in {A}_{i,l}$ \StatePull arm $j$ for $m$ times \EndFor\StateUpload the average reward to the server; ${k}_{l}={k}_{l1}$ \Else\StatePull each arm ${m}_{l}=\lceil {4}^{l+2}\mathrm{ln}MKT\rceil $ times and store the average for arm $a$ as $u(a,l)$ \StatePull arms arbitrarily to wait to make total pulls ${m}_{l}\times {k}_{max}$ \StateSend $(\mathrm{arg}\mathrm{max}u(\cdot ,l),\mathrm{max}u(\cdot ,l))$ to server, wait for server to calculate maximum ${u}_{l}^{*}$ \StatePerform elimination on local arm set
$${A}_{l+1}^{(i)}=\{i\in {A}_{l}^{(i)}:{\widehat{\mu}}_{i,l}+{2}^{l}\ge {u}_{l}^{*}\}$$ 
report remaining number of arms ${n}_{l}(i)=\left{A}_{l+1}^{(i)}\right$ to server, wait for instruction \StateUpdate ${k}_{l}$ \EndIf\StateWait for notification from server \EndFor
Note: By “allocate arms to agents evenly” and “schedule $m$ pulls”, we mean the following. If ${A}_{l}\ge M$,

•
The server partitions ${A}_{l}$ into $M$ disjoint sets: ${A}_{1,l}\cup {A}_{2,l}\cup \mathrm{\dots}\cup {A}_{M,l}$ such that
$$\lfloor \frac{{A}_{l}}{M}\rfloor \le {A}_{i,l}\le \lceil \frac{{A}_{l}}{M}\rceil $$ 
•
${m}_{i,l}$ is set to be ${m}_{l}$ for all $i$

•
If ${A}_{l}/M$ is not an integer, we set $fla{g}_{i}$ to $1$ if and only if
$${A}_{i,l}=\lfloor \frac{{A}_{l}}{M}\rfloor .$$ This way, within each $l$, the total number of pulls for any agent will be the same.
If $$,

•
The server partitions $[M]$ into $n={A}_{l}$ disjoint sets ${B}_{1}\cup \mathrm{\dots}\cup {B}_{n}$ such that
$$b=\lfloor \frac{M}{{A}_{l}}\rfloor \le {B}_{i}\le \lceil \frac{M}{{A}_{l}}\rceil .$$ 
•
If agent $i$ is in ${B}_{j}$, the instruction will be
$$(\{j\},\lceil \frac{{m}_{l}}{b}\rceil ,0)$$
When timestep $T+1$ is reached, we automatically terminate the protocol.
4.1.3 Analysis
First of all, we state the following facts regarding the execution of the protocol.
Fact 1.
We use ${A}_{l}$ to refer to the set of remaining arms at beginning of phase $l$ (it is either ${A}_{l}$ or ${\mathrm{\bigcup}}_{i}{A}_{l}^{\mathrm{(}i\mathrm{)}}$ in the code). Then ${A}_{l}$ behaves as if a centralized phasedelimination algorithm is acting on it.
Fact 2.
For a fixed arm $i$, in a given phase $l$, $i$ is pulled (by all agents) at most $\mathrm{max}\mathit{}\mathrm{\{}\mathrm{2}\mathit{}{m}_{l}\mathrm{,}M\mathrm{/}K\mathrm{+}\mathrm{1}\mathrm{\}}$ times.
Fact 3.
Let $L\mathrm{=}\mathrm{\lceil}\mathrm{ln}\mathit{}M\mathit{}T\mathrm{\rceil}$. Then the largest reached $l$ when executing the protocol is at most $L$.
Proof.
If the protocol terminates at the $l$th phase, then the $l1$th phase must be terminated. Therefore $\lceil {4}^{l+1}\mathrm{ln}MKT\rceil \le MT\le \mathrm{exp}\{L\}.$ We conclude that $$. ∎
For convenience, if the protocol is terminated during the $l$th phase ($$), we define ${A}_{l+1}$,…,${A}_{L}$ to be ${A}_{l}$.
Next, we use arguments similar to those used in the single agent phasedelimination algorithm, and prove the following useful lemmas. With out loss of generality, we assume that arm $1$ is the optimal arm. Let ${\mathrm{\Delta}}_{k}=\mu (1)\mu (k)\ge 0$.
Lemma 4.1.
With probability $\mathrm{1}\mathrm{}\mathrm{2}\mathrm{/}\mathrm{(}M\mathit{}K\mathit{}T\mathrm{)}$, for all phases $l$, all arms $i\mathrm{\in}{A}_{l}$,
$$\left{\widehat{\mu}}_{i,l}{\mu}_{i,l}\right\le {2}^{l1}$$ 
Proof.
For any fixed $l$, fixed $i\in {A}_{l}$, ${\widehat{\mu}}_{i,l}$ is the empirical average of ${m}_{l}$ samples. Therefore by Hoeffding’s inequality,
$\mathrm{Pr}\left[\left{\widehat{\mu}}_{i,l}{\mu}_{i,l}\right>{2}^{l1}\right]$  $\le 2\mathrm{exp}\left\{{\displaystyle \frac{{m}_{l}{2}^{2l}}{8}}\right\}$  
$\le {\displaystyle \frac{2}{{M}^{2}{K}^{2}{T}^{2}}}$ 
The number of phases is at most $L\le MT$. Taking a union bound over all $i$ and $l$ proves the theorem. ∎
Lemma 4.2.
With probability $\mathrm{1}\mathrm{}\mathrm{2}\mathrm{/}\mathrm{(}M\mathit{}K\mathit{}T\mathrm{)}$, the following holds:
1. With probability , $\mathrm{1}\mathrm{\in}{A}_{l}$ for all $l$.
2. For $i\mathrm{\in}\mathrm{[}K\mathrm{]}$ with ${\mathrm{\Delta}}_{i}\mathrm{>}\mathrm{0}$, define ${l}_{i}\mathrm{=}\mathrm{\lceil}{\mathrm{log}}_{\mathrm{2}}\mathit{}{\mathrm{\Delta}}_{i}^{\mathrm{}\mathrm{1}}\mathrm{\rceil}\mathrm{+}\mathrm{1}$. Then $i\mathrm{\notin}{A}_{{l}_{i}\mathrm{+}\mathrm{1}}$.
Proof.
We condition on the event described in lemma 5.1. Observe that
$${\widehat{\mu}}_{1,l}\ge \mu (1){2}^{l1}\ge \mu (i){2}^{l1}\ge {\widehat{\mu}}_{i,l}{2}^{l}.$$ 
Therefore arm $1$ will never be eliminated. This proves the first assertion. On the other hand, at the end of round ${l}_{i}$,
${\widehat{\mu}}_{i,l}$  $\le \mu (i)+{2}^{{l}_{i}1}=\mu (1){\mathrm{\Delta}}_{i}+{2}^{{l}_{i}1}\le \mu (1){2}^{{l}_{i}1}.$ 
Therefore even if $i$ remains to ${A}_{{l}_{i}}$, it will be eliminated at that round. This proves the second assertion. ∎
We are now ready to state and prove our main results for the eliminationbased algorithm for distributed multiarmed bandit.
Theorem 3.
Protocol 3 achieves expected regret $O\mathit{}\mathrm{(}\sqrt{M\mathit{}K\mathit{}T\mathit{}\mathrm{log}\mathit{}T}\mathrm{)}$. It has communication complexity bounded by $O\mathit{}\mathrm{(}M\mathit{}\mathrm{log}\mathit{}T\mathrm{+}K\mathrm{)}$ (worst case), and expected communication complexity $O\mathit{}\mathrm{(}M\mathit{}\mathrm{log}\mathit{}M\mathit{}K\mathit{}T\mathrm{)}$.
Proof.
Regret: Let ${n}_{T}(i)$ be the number of times arm $i$ is pulled upon termination, which is a random variable. We know that total expected regret is
$$REG(T)=\sum _{i:{\mathrm{\Delta}}_{i}>0}\mathbb{E}[{n}_{T}(i)]\cdot {\mathrm{\Delta}}_{i}.$$ 
We also know that in phase $l$, either (1) the phase only lasts for $1$ step (this happens when ${k}_{l1}\cdot {m}_{l}\le M$) or (2) any arm is pulled for at most $2{m}_{l}$ times. If the phase only lasts for one step, then it holds that (i) there would be at most $M/K+1\le 2M/K$ pulls for an arm; and that (ii) ${4}^{l}\mathrm{ln}(MKT)\le {m}_{l}\le M$.
So by the lemmas above,
$\mathbb{E}[{n}_{T}(i)]$  $\le {\displaystyle \sum _{l=1}^{{l}_{i}}}2{m}_{l}+{\displaystyle \frac{2}{MKT}}\cdot MT+{\displaystyle \frac{2M\mathrm{ln}M}{K}}$  
$\le {\displaystyle \frac{{C}_{1}}{{\mathrm{\Delta}}_{i}^{2}}}\mathrm{ln}(MTK)+{\displaystyle \frac{{C}_{2}M\mathrm{ln}M}{K}},$ 
where ${C}_{1}$ and ${C}_{2}$ are universal constants. Assume that $T>M\mathrm{ln}M+K$. Then
$REG(T)$  $\le {\displaystyle \sum _{i:{\mathrm{\Delta}}_{i}>\u03f5}}{C}_{1}\cdot {\displaystyle \frac{1}{{\mathrm{\Delta}}_{i}}}\mathrm{ln}MKT+{\displaystyle \sum _{i:{\mathrm{\Delta}}_{i}\le \u03f5}}\mathbb{E}[{n}_{T}(i)]\cdot {\mathrm{\Delta}}_{i}+{C}_{2}M\mathrm{ln}M$  
$\le {\displaystyle \sum _{i:{\mathrm{\Delta}}_{i}>\u03f5}}{\displaystyle \frac{{C}_{1}}{\u03f5}}\mathrm{ln}MKT+\u03f5MT+{C}_{2}M\mathrm{ln}M$  
$\le {\displaystyle \frac{{C}_{3}K\mathrm{ln}T}{\u03f5}}+\u03f5MT+{C}_{4}\sqrt{KMT\mathrm{ln}T}.$ 
Choose $\u03f5=\sqrt{K\mathrm{ln}T/(MT)}$. Then we get $REG(T)\le {C}_{5}\sqrt{KMT\mathrm{ln}T}$. Herer ${C}_{3}$, ${C}_{4}$ and ${C}_{5}$ are all universal constants.
Communication: First, consider the protocol running with $$. Then, sending instructions involve sending at most $2{C}_{0}M$ numbers; similarly receiving averages involve communication $O(M)$ of numbers. Hence, after a protocol has $$, total communication is at most $O(ML)=O(M\mathrm{log}T)$.
Now, consider the case with ${k}_{l1}>{C}_{0}M$. Suppose that at the end of phase $l$, if no reassignment is needed, then communication in a phase is still $O\left(M\right).$ If reassignment is needed, the number of communicated numbers is at most the number of arms eliminated between that phase and the last reassignment (or assignment). Therefore, when running with ${k}_{l1}>{C}_{0}M$ for several phases, total communication is bounded (up to a universal constant) by $M\mathrm{log}T$ plus the number of eliminated arms. Therefore the total communication in the worst case is $O\left(M\mathrm{log}T+K\right).$
Now we focus on expected communication complexity. First, observe that the elimination process and the allocation of arms is completely independent. Therefore, we can apply concentration inequality for the set of remaining arms (although this set of arms is random, independence enables us to not use a union bound). More specifically, let
$${X}_{j,l}=I[j\in {A}_{l}],{Y}_{j,i}=I[\text{arm}j\text{is allocated to agent}i\text{initially}].$$ 
Then ${X}_{j,l}$ and ${Y}_{{j}^{\prime},i}$ are independent. Now suppose that no reassignment happened until phase $l1$. In phase $l$, after elimination, the number of arms remaining for agent $i$ is ${n}_{l}(i)={\sum}_{j=1}^{K}{X}_{j,l}{Y}_{j,i}$. We have that
$$\mathbb{E}\left[{n}_{l}(i)\right]=\sum _{j=1}^{K}\mathbb{E}{X}_{j,l}\mathbb{E}{Y}_{j,i}=\frac{1}{M}\mathbb{E}\left[{k}_{l}\right].$$ 
By Chernoff’s bound,
$$  $$ 
as long as
$$\mathbb{E}\left[{k}_{l}\right]>{C}_{6}M\mathrm{log}\frac{1}{\delta}.$$ 
(${C}_{6}$ is some universal constant.) Here, we only need $\delta $ to be ${\left(MKL\right)}^{1}$. After taking a union bound on $l$ and $i$, we arrive at the following statement: for any $l$, if
$$\mathbb{E}\left[{k}_{l}\right]>2{C}_{6}M\mathrm{log}\left(MKL\right),$$ 
then with probability $11/K$, there will be no reassignment until phase $l$.
On the other hand, if
$$ 
we can see that remaining total communication is
$$O\left(M\mathrm{log}T+M\mathrm{log}\left(MK\mathrm{log}T\right)\right)$$ 
by our worst case argument above. Therefore, total expected communication is upperbounded by
$${C}_{2}\left(M\mathrm{log}T+M\mathrm{log}MK+M\mathrm{log}\mathrm{log}T+\frac{1}{K}\left(M\mathrm{log}T+K\right)\right)=O\left(M\mathrm{log}MTK\right)$$ 
Under our usual assumption that $T>M\mathrm{ln}M+K$, this can be simplified to $O\left(M\mathrm{log}T\right)$. ∎
One caveat in this protocol is that it uses $K\mathrm{log}M$ bits of public randomness. However, using Newman’s theorem [22], we can remove this usage of public randomness at the cost of $O(M\mathrm{log}(MT))$ bits of additional communication. Since the task is not evaluating a function here, the argument is a simple modification of that of Newman’s theorem (see Appendix. B).
4.2 An Eliminationbased Protocol for Linear Bandits
This protocol is based on a single agent algorithm (algorithm 12 in [18]), which also iteratively eliminate arms from the initial action set. Similar to Protocol 3, we parallelize the data collection part of each phase, and send instructions in a communicationefficient way.
4.2.1 Protocol Description
For convenience, we define $V(\pi )={\sum}_{a\in \mathcal{D}}\pi (a)a{a}^{T},$ $g(\pi )={\mathrm{max}}_{a\in \mathcal{D}}{a}^{T}V{(\pi )}^{1}a.$ {algorithm}[H] \[email protected]@algorithmic[1] \State${A}_{1}=\mathcal{D}$ \For$l=1,2,3,\mathrm{\dots}$ \StateFind distribution ${\pi}_{l}(\cdot )$ over ${A}_{l}$ such that: 1. its support has size at most $\xi =32d\mathrm{ln}\mathrm{ln}d$; 2. $g(\pi )\le 2d$. \StateSchedule ${m}_{l}(a)$ pulls for each arm $a\in {A}_{l}$; ${m}_{l}(a)=\lceil {C}_{1}{4}^{l}{d}^{2}{\pi}_{l}(a)\mathrm{ln}MT\rceil $ \StateReceive rewards for each arm $a\in {A}_{l}$ reported by agents \StateFor each arm in the support of ${\pi}_{l}(\cdot )$, calculate the average reward $\mu (a)$ \StateCompute
$$X=\sum _{a\in {A}_{l}}{m}_{l}(a)\mu (a)a,{V}_{l}=\sum _{a\in {A}_{l}}{m}_{l}(a)a{a}^{T},\widehat{\theta}={V}_{l}^{1}X.$$ 
Send $\widehat{\theta}$ to all agents \StateEliminate low rewarding arms:
$${A}_{l+1}=\{a\in {A}_{l}:\underset{b\in {A}_{l}}{\mathrm{max}}\u27e8\widehat{\theta},ba\u27e9\le {2}^{l+1}\}.$$ 
[H]
\[email protected]@algorithmic[1] \State${A}_{1}=\mathcal{D}$ \For$l=1,\mathrm{\dots}$ \StateFind distribution ${\pi}_{l}(\cdot )$ over ${A}_{l}$ such that: 1. its support has size at most $\xi =32d\mathrm{ln}\mathrm{ln}d$; 2. $g(\pi )\le 2d$. \StateReceive a set of arms, and numbers to pull each of them \StatePull the arms according to instruction \StateReport the average reward for each arm \StateReceive $\widehat{\theta}$ from server \StateEliminate low rewarding arms on local copy:
$${A}_{l+1}=\{a\in {A}_{l}:\underset{b\in {A}_{l}}{\mathrm{max}}\u27e8\widehat{\theta},ba\u27e9\le {2}^{l+1}\}.$$ 
As above, if the protocol reaches timestep $T+1$, it is automatically terminated. The specific meaning of “schedule ${m}_{l}(a)$ pulls for arm $a$” will be explained in the proof of Theorem 4.
The task in line $3$ of the protocol is closely related to finding a $G$optimal experiment design [23]. In the $G$optimal design task, we require $g(\pi )$ to be actually minimized, and the minimum value is $d$ [16]. For the exact $G$optimal design task, it is known that there is a solution ${\pi}^{*}$ such that the support of ${\pi}^{*}$ (also known as core set in literature) has size at most $\xi =d(d+1)/2$ [27]. Here, we only need a $2$approximation to $g(\pi )$, but require the solution to have a small support size $S$.
This task can be shown to be feasible. In particular, when $\mathcal{D}$ is finite, the FrankWolfe algorithm under appropriate initialization can find such an approximate solution (see Proposition 3.19 [27]). When $\mathcal{D}$ is infinite, we may replace $\mathcal{D}$ with an epsilonnet of $\mathcal{D}$ (and only take actions in the $\u03f5$net). If $$, then our regret bound when the action set is the $\u03f5$net implies the regret bound when the action set is $\mathcal{D}$. This approach, albeit feasible, may not be efficient.
4.2.2 Analysis
First, we consider some properties of the above protocol.
Fact 4.
The length of the $l$th phase is
$${C}_{1}{4}^{l}{d}^{2}\mathrm{log}MT\le {T}_{l}\le 32d\mathrm{log}\mathrm{log}d+{C}_{1}{4}^{l}{d}^{2}\mathrm{log}MT$$ 
Lemma 4.3.
At phase $l$, with probability $\mathrm{1}\mathrm{}\mathrm{1}\mathrm{/}T\mathit{}M$, for any $a\mathrm{\in}\mathrm{D}$,
$$\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right\le {2}^{l}.$$ 
Proof.
First, construct an $\u03f5$covering of $\mathcal{D}$ with ${\u03f5}_{l}={2}^{l2}$. Denote the center of the covering as $X=\{{x}_{1},\mathrm{\dots},{x}_{Q}\}$. Here $Q\le {({C}_{2})}^{d(l+2)}$.
For fixed $a\in \mathcal{D}$, we know that with probability $12\delta $,
$$\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right\le \sqrt{2{\parallel a\parallel}_{{V}_{l}^{1}}^{2}\mathrm{log}\frac{1}{\delta}}.$$ 
In our case,
${\parallel a\parallel}_{{V}_{l}^{1}}^{2}\le {\displaystyle \frac{2}{{4}^{l}{C}_{1}d\mathrm{log}MT}}.$ 
Therefore with probability $12\delta $,
$$\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right\le {2}^{l+1}\sqrt{\frac{1}{{C}_{1}d\mathrm{log}MT}\mathrm{log}\frac{1}{\delta}}.$$ 
Choose $\delta =1/(2TMQ)$. We can see that that there exists a ${C}_{1}$ such that with probability $11/(TM)$, for all $a\in X$
$$\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right\le {2}^{l1}.$$ 
Now, consider an arbitrary $a\in \mathcal{D}$. There exists $x\in X$ such that $\parallel ax\parallel \le {2}^{l1}$. Therefore with probability $11/TM$, for any $a\in \mathcal{D}$,
$\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right$  $\le \left\u27e8\widehat{\theta}{\theta}^{*},x\u27e9\right+\left\u27e8\widehat{\theta}{\theta}^{*},ax\u27e9\right$  
$\le {2}^{l1}+\parallel \widehat{\theta}\widehat{{\theta}^{*}}\parallel \cdot \parallel ax\parallel $  
$\le {2}^{l}.$ 
∎
Lemma 4.4.
Let ${a}^{\mathrm{*}}\mathrm{=}\mathrm{arg}\mathit{}{\mathrm{max}}_{a\mathrm{\in}D}\mathit{}\mathrm{\u27e8}{\theta}^{\mathrm{*}}\mathrm{,}a\mathrm{\u27e9}$ be the optimal arm. Then with probability $\mathrm{1}\mathrm{}\mathrm{log}\mathit{}\mathrm{(}M\mathit{}T\mathrm{)}\mathrm{/}\mathrm{(}T\mathit{}M\mathrm{)}$, ${a}^{\mathrm{*}}$ will not be eliminated until the protocol terminates.
Proof.
If ${a}^{*}$ is eliminated at the end of round $l$, one of the following must happen: either (1) $\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right>{2}^{l}$; or (2) there exists $a\ne {a}^{*}$, $\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right>{2}^{l}$. Therefore the probability for ${a}^{*}$ to be eliminated at a particular round is less than $11/(TM)$. The total number of phases is at most $\mathrm{log}MT$. Hence a union bound proves the proposition. ∎
Lemma 4.5.
For suboptimal $a\mathrm{\in}\mathrm{D}$, define ${l}_{a}\mathrm{=}\mathrm{inf}\mathit{}\mathrm{\{}l\mathrm{:}\mathrm{4}\mathrm{\cdot}{\mathrm{2}}^{\mathrm{}l}\mathrm{\le}{\mathrm{\Delta}}_{a}\mathrm{\}}$. Then with probability $\mathrm{1}\mathrm{}\delta $, for any suboptimal $a$, $a\mathrm{\ne}{A}_{{l}_{a}}$. $\delta \mathrm{=}\mathrm{2}\mathit{}\mathrm{log}\mathit{}\mathrm{(}T\mathit{}M\mathrm{)}\mathrm{/}T\mathit{}M$.
Proof.
First, let us only consider the case where ${a}^{*}$ is not eliminated. That is,
$\mathrm{Pr}\left[\exists a\in \mathcal{D}:a\in {A}_{{l}_{a}}\right]$  $\le \mathrm{Pr}\left[{a}^{*}\text{eliminated}\right]+\mathrm{Pr}\left[\exists a:a\in {A}_{{l}_{a}1},a\in {A}_{{l}_{a}}{a}^{*}\in {A}_{{l}_{a}}\right].$ 
Note that conditioned on ${a}^{*}\in {A}_{{l}_{a}}$, $\left\{a\in {A}_{{l}_{a}1}\wedge a\in {A}_{{l}_{a}}\right\}$ implies that at phase ${l}_{a}$, either $\left\u27e8\widehat{\theta}{\theta}^{*},a\u27e9\right>{2}^{{l}_{a}}$ or $\left\u27e8\widehat{\theta}{\theta}^{*},{a}^{*}\u27e9\right>{2}^{{l}_{a}}$. Therefore the probability that there exists such $a$ is less than $\mathrm{log}(TM)/TM$. Hence, with probability $12\mathrm{log}(TM)/TM$, $a$ will be eliminated before phase ${l}_{a}$. ∎
We are now ready to state and prove our main result for Protocol 4.
Theorem 4.
Protocol 4 has expected regret $O\mathit{}\mathrm{\left(}d\mathit{}\sqrt{T\mathit{}M\mathit{}\mathrm{log}\mathit{}T}\mathrm{\right)}$, and has communication complexity $O\mathit{}\mathrm{\left(}\mathrm{(}M\mathit{}d\mathrm{+}d\mathit{}\mathrm{log}\mathit{}\mathrm{log}\mathit{}d\mathrm{)}\mathit{}\mathrm{log}\mathit{}T\mathrm{\right)}$.
Proof.
Regret: We note that at the start of round $l$, the remaining arms have suboptimality gap at most $4\cdot {2}^{l}$. Suppose that the last finished phase is $L$. Therefore total regret is
$REG(T)$  $\le {\displaystyle \sum _{l=1}^{L}}{C}_{1}{4}^{l}{d}^{2}\mathrm{log}MT\cdot 4\cdot {2}^{l}+\delta \cdot 2MT$  
$\le {C}_{3}{2}^{L}{d}^{2}\mathrm{log}TM.$ 
Apparently ${C}_{1}{4}^{L}{d}^{2}\mathrm{log}TM\le TM$. Therefore
$REG(T)\le \sqrt{{C}_{3}^{2}{4}^{L}{d}^{4}{\mathrm{log}}^{2}TM}\le {C}_{4}d\sqrt{TM\mathrm{log}TM}.$ 
Under our usual assumption that $T>M$, this can be simplified to $O(d\sqrt{TM\mathrm{log}T})$.
Communication Complexity: Observe that at the start of each phase, each agent has the same ${A}_{l}$ as the server. Therefore, at line $3$ they obtain the same ${\pi}_{l}$. In that case, when allocating the arms, the server only needs to send a index (${\mathrm{log}}_{2}\xi $ bits), instead of a vector ($\mathrm{\Omega}(d)$ bits), to identify an arm.
Now we specifically state how to schedule ${m}_{l}(a)$ pulls for each arm. Let us use ${p}_{l}={\sum}_{a}{m}_{l}(a)/M$ to denote the average pulls each agent needs to perform. Starting from the arm with the largest ${m}_{l}(a)$, we schedule pulls in full blocks whenever possible. For instance if ${m}_{l}(a)=4.1{p}_{l}$, we give assign four agents with arm $a$ only; and for the fifth agent, we assign him with $0.1p$ of load. Suppose that this scheduling is done one arm by another with descending load. Continuing the above example, if for the next arm ${m}_{l}({a}^{\prime})=3.5{p}_{l}$, we assign $0.9{p}_{l}$ to the fifth agent, and the rest to $3$ agents. Observe that for each arm, the number of agents that it is assigned to is at most $1+\lceil {m}_{l}(a)/{p}_{l}\rceil $ agents. Therefore, total communication for scheduling is at most
$$\sum _{a}\left(\lceil {m}_{l}(a)/{p}_{l}\rceil +1\right)\le 2\xi +M=O(M+d\mathrm{log}\mathrm{log}d).$$ 
Similarly, total communication for reporting averages is the same. The cost for sending $\widehat{\theta}$ is $Md$. Hence, communication cost per phase is $O(Md+d\mathrm{log}\mathrm{log}d)$. On the other hand, total number of phases is apparently $O(\mathrm{log}TM)$. Hence total communication is
$$O\left((Md+d\mathrm{log}\mathrm{log}d)\mathrm{log}TM\right)=O\left(Md\mathrm{log}\mathrm{log}d\mathrm{log}TM\right).$$ 
Under the assumption that $T>M$, this can be simplified to $O\left((Md+d\mathrm{log}\mathrm{log}d)\mathrm{log}T\right)$. ∎
Note that the replicated calculation of ${\pi}_{l}(\cdot )$ on the agent’s side is a trick purposed to reduce the number of bits/packets required to designate an arm. If this trick is not used, the server may need $d$ numbers, instead of one, to identify an arm; therefore the total communication complexity will become $O\left((Md+{d}^{2}\mathrm{log}\mathrm{log}d)\mathrm{log}T\right)$.
5 Discussion
5.1 Lower Bound for Multiarmed Bandits
Intuitively, in order to avoid a $\mathrm{\Omega}\left(M\sqrt{KT}\right)$ scaling of regret, $O(M)$ amount of communication is necessary: otherwise most of the agents can hardly do better than a singleagent algorithm. We prove this intuition in the following lower bound(which applies to the expected communication complexity of randomized protocols).
Theorem 5.
For protocols with expected communication complexity less than $M\mathrm{/}c$, there exists multiarmed bandits instances such that total regret scales as $\mathrm{\Omega}\mathit{}\mathrm{(}M\mathit{}\sqrt{K\mathit{}T}\mathrm{)}$. Here $c\mathrm{=}\mathrm{3000}$.
The proof of this lower bound is a simple reduction from singleagent bandits to multiagent bandits, mapping protocols to singleagent algorithms (see Appendix. C). Since one can simulate MAB with linear bandits by setting an action set of $K$ orthogonal vectors, this directly induces a $\mathrm{\Omega}(M\sqrt{dT})$ lower bound for linear bandits when communication complexity is less than $M/c$.
We note that our results in distributed multiarmed bandits match this lower bound except for a $\mathrm{log}T$ factor. In this sense our protocol for multiarmed bandits is indeed communicationefficient. Moreover, since $\mathrm{\Omega}(M\sqrt{KT})$ is achievable with no communication at all, it is essentially the worst regret for a communication protocol. Therefore, this lower bound suggests that the communicationregret tradeoff for distributed MAB is a steep one: with $O\left(M\mathrm{log}T\right)$ communication, regret can be nearoptimal; with slightly less communication ($M/c$), regret necessarily deteriorates to the worst case.
5.2 Comparison of Optimismbased and Eliminationbased Protocols
In the distributed multiarmed bandit task and linear bandit task, optimismbased protocols (Protocol 1 and 2) require $O\left(MK\mathrm{log}M\right)$ and $O\left({M}^{1.5}{d}^{3}\right)$ communication respectively. Eliminationbased protocols, however, require only $\left(M\mathrm{log}T\right)$ and $O\left(Md\mathrm{log}\mathrm{log}d\mathrm{log}T\right)$. In the regime where $T$ is not exponentially large, eliminationbased protocols seem to be more communicationefficient than optimismbased protocols. In particular, the dependence on $K$ or $d$ is significantly lower; this is mostly because eliminationbased protocols allocate arms to agents.
On the other hand, we also note that optimismbased protocols (Protocol 1 and 2), compared to eliminationbased ones (Protocol 3 and 4), seem somewhat more natural and practical. For instance, the optimization involved in Protocol 4 is potentially much harder than the optimization used in Protocol 2.
5.3 Future Work
In this work, we propose communicationefficient protocols with nearoptimal worstcase regret bounds. However, this may not guarantee closetocentralized performance for a particular instance. Adopting our results to the case where instancedependent regret is the performance measure may be an interesting problem. Another interesting problem is providing a better communication lower bound for distributed linear bandits (in order to achieve nearoptimal worstcase regret). We conjecture that closeto$Md$ amount of communication may be necessary, which is the case in offline distributed linear regression [29, 8] in the context of achieving an optimal risk rate.
It will also be great if the two design principles, namely optimism and elimination, can be combined to get the best of both worlds. Communication complexity of such protocols potentially may enjoy small dependencies on timescale $T$ and size of action set ($K$ or $d$) simultaneously.
We also realize that realworld distributed systems typically suffer communication latency, and computation of the agents may be asynchronous. Therefore it would be interesting to consider protocols that tolerates latency and delays. However, we note that to some extent, the synchronous case is the hard case for regret minimization, as there are $M$ agents working in concurrent throughout the execution. If we can keep most agents and only actively run $1$ agent at a time, optimal regret can be easily achieved by simulating a singleagent algorithm, using communication less than $M$ times the space complexity of that singleagent algorithm.
6 Acknowledgments
The authors thank Chi Jin, Chicheng Zhang and Nan Jiang for helpful discussions.
References
 [1] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 [2] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
 [3] Deepak Agarwal, BeeChung Chen, Pradheep Elango, Nitin Motgi, SeungTaek Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. In Advances in Neural Information Processing Systems, pages 17–24, 2009.
 [4] JeanYves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
 [5] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In COLT23th Conference on learning theory2010, pages 13–p, 2010.
 [6] Peter Auer. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 [7] Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multiarmed bandit problem. Periodica Mathematica Hungarica, 61(12):55–65, 2010.
 [8] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pages 1011–1020. ACM, 2016.
 [9] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [10] Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
 [11] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multiagent multiarmed bandits. In IJCAI, pages 164–170. ijcai.org, 2017.
 [12] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 [13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
 [14] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multiarmed bandits. In Advances in Neural Information Processing Systems, pages 854–862, 2013.
 [15] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, pages 423–439, 2014.
 [16] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
 [17] Nathan Korda, Balázs Szörényi, and Li Shuai. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, volume 48, pages 1301–1309. International Machine Learning Societ, 2016.
 [18] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2019.
 [19] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 [20] Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimaxoptimal regret for linearly parameterized bandits. arXiv preprint arXiv:1904.00242, 2019.
 [21] Oded Maron and Andrew W Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in neural information processing systems, pages 59–66, 1994.
 [22] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
 [23] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
 [24] Alistair Sinclair. Lecture 11: February 20. https://people.eecs.berkeley.edu/~sinclair/cs271/n11.pdf.
 [25] Balázs Szörényi, Róbert BusaFekete, István Hegedűs, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossipbased distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056–1064. International Machine Learning Societ, 2013.
 [26] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multiarmed bandits. arXiv preprint:1904.03293, 2019.
 [27] Michael J Todd. Minimumvolume ellipsoids: Theory and algorithms, volume 23. SIAM, 2016.
 [28] YouGan Wang. Sequential allocation in clinical trials. Communications in StatisticsTheory and Methods, 20(3):791–805, 1991.
 [29] Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Informationtheoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013.
Appendix A A A Reduction from Regret Minimization Protocols to Best Arm Identification Protocols
We show how to produce a communication protocol identifying the best arm with high probability using our regret minimization protocols.
Theorem 6.
For any multiarmed bandits instance $\nu $, suppose arm 1 is the best arm. Let ${\mathrm{\Delta}}_{i}\mathrm{(}i\mathrm{>}\mathrm{1}\mathrm{)}\mathrm{:=}\mu \mathrm{(}\mathrm{1}\mathrm{)}\mathrm{}\mu \mathrm{(}i\mathrm{)}\mathrm{,}{\mathrm{\Delta}}_{\mathrm{min}}\mathrm{:=}{\mathrm{min}}_{k\mathrm{>}\mathrm{1}}{\mathrm{\Delta}}_{k}\mathrm{,}{\mathrm{\Delta}}_{\mathrm{max}}\mathrm{:=}{\mathrm{max}}_{k\mathrm{>}\mathrm{1}}{\mathrm{\Delta}}_{k}$. We assume that $$ for some constant ${c}_{d}\mathrm{>}\mathrm{0}$.
Assume that we have a a regret minimization protocol $\mathrm{A}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}T\mathrm{)}$, in which the server is able to calculate the number of pulls among all agents for each arm at the end of the protocol (our protocol 1 and protocol 3 both satisfy this property). Moreover, $\mathrm{A}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}T\mathrm{)}$ achieves nearoptimal minimax regret bound $O\mathit{}\mathrm{(}\sqrt{M\mathit{}K\mathit{}T\mathit{}\mathrm{log}\mathit{}T}\mathrm{)}$.
Given $\mathrm{A}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}T\mathrm{)}$ running on $\nu $, there exists a new protocol ${\mathrm{A}}^{\mathrm{\prime}}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}\u03f5\mathrm{)}$ that identifies the best arm as long as $$ with probability at least $\mathrm{1}\mathrm{}\delta $ for any $\delta \mathrm{>}\mathrm{0}$. Furthermore, ${\mathrm{A}}^{\mathrm{\prime}}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}\u03f5\mathrm{)}$ has $\stackrel{\mathrm{~}}{\mathrm{\Omega}}\mathit{}\mathrm{(}M\mathrm{)}$ speedup, with communication cost at most ${C}_{\mathrm{A}}\mathit{}\mathrm{(}M\mathrm{,}K\mathrm{,}\stackrel{\mathrm{~}}{O}\mathit{}\mathrm{(}K\mathrm{/}\mathrm{(}M\mathit{}{\u03f5}^{\mathrm{2}}\mathrm{)}\mathrm{)}\mathrm{)}$.
Proof.
${\mathcal{A}}^{\prime}(M,K,\u03f5)$ runs on $\nu $ as follows:

1.
Let ${T}_{1}$ be some integer that ${T}_{1}=\stackrel{~}{O}(K/(M{\u03f5}^{2}))$. Run $\mathcal{A}(M,K,{T}_{1})$.

2.
Suppose the global pulls for arm $k$ among all agents at the end of $\mathcal{A}$ is ${N}_{k}$ (maintained by the server), the server randomly choose an arm ${k}^{*}$ with probability ${N}_{k}/{\sum}_{k=1}^{K}{N}_{k}$ for arm $k$.

3.
returns arm ${k}^{*}$.
Let’s analyze protocol ${\mathcal{A}}^{\prime}$ now.
Recall that
$$REG({T}_{1})=\sum _{t=1}^{{T}_{1}}\sum _{i=1}^{M}(\mu (1)\mu ({a}_{t,i}))=MT\mu (1)\sum _{k=1}^{K}{N}_{k}\mu (k)=O(\sqrt{MK{T}_{1}\mathrm{log}{T}_{1}})$$ 
so we have
$$\frac{REG({T}_{1})}{M{T}_{1}}\le O(\sqrt{\frac{K\mathrm{log}{T}_{1}}{M{T}_{1}}})$$ 
By choosing proper ${T}_{1}=\stackrel{~}{O}(K/(M{\u03f5}^{2}))$, we have
$$\frac{REG({T}_{1})}{M{T}_{1}}\le \frac{\u03f5}{3}$$ 
Therefore, by Markov’s inequality,
$$ 
Note that by median trick [24], in order to bound the error probability by $\delta $, the server only needs to sample ${k}^{*}$ independently for $O(\mathrm{log}(1/\delta ))$ times, and return the majority vote of them.
It’s easy to observe the communication cost of ${\mathcal{A}}^{\prime}$ by ${C}_{\mathcal{A}}(M,K,{T}_{1})={C}_{\mathcal{A}}(M,K,\stackrel{~}{O}(K/(M{\u03f5}^{2})))$.
According to [5], at least $\stackrel{~}{\mathrm{\Omega}}(\mathrm{log}(1/\delta {\sum}_{i=1}^{K}1/{\mathrm{\Delta}}_{i}^{2})$ samples are required by any singleagent algorithm to identify the best arm with probability $1\delta $. By our assumption we know that $K/{\u03f5}^{2}=O({\sum}_{i=1}^{K}1/{\mathrm{\Delta}}_{i}^{2})$, therefore the speedup is $\stackrel{~}{\mathrm{\Omega}}(M)$.
∎
Note that if $\mathcal{A}$ is Protocol 3, we are able to identify the best arm using $\stackrel{~}{O}(\mathrm{log}(K/M{\mathrm{\Delta}}_{\mathrm{min}}))$ communication rounds. Further more, if $\mathcal{A}$ is our Protocol 1, we can identify the best arm with communication complexity independent of ${\mathrm{\Delta}}_{\mathrm{min}}$; hence it is a very efficient protocol when ${\mathrm{\Delta}}_{\mathrm{min}}$ is small.
Appendix B B Removing Public Randomness
Protocol 3 exploits a certain amount of shared randomness. We now discuss how to remove the usage of public randomness with little increase in communication complexity.
The role of shared randomness in communication complexity has already been investigated; it is known that we can efficiently “privatize” shared randomness, as stated by the Newman’s Theorem [22]. In our case, the arugment is slightly different: we are considering regret instead of evaluating a function. In particular, we show the following theorem.
Theorem 7.
There exists a protocol that does not use public randomness with expected regret $O\mathit{}\mathrm{(}\sqrt{M\mathit{}K\mathit{}T\mathit{}\mathrm{log}\mathit{}T}\mathrm{)}$. It has communication complexity bounded by $O\mathit{}\mathrm{(}M\mathit{}\mathrm{log}\mathit{}T\mathrm{+}K\mathrm{)}$ (worst case), and expected communication complexity $O\mathit{}\mathrm{(}M\mathit{}\mathrm{log}\mathit{}T\mathrm{)}$.
Proof.
We make the following modifications to Protocol 3. Instead of using a public random bit string $r$, we predetermine $B$ strings ${r}_{1}$,…,${r}_{B}$, and randomly choose from them. That is, the server will generate a random number in $[B]$, and broadcast it to everyone. The communication cost of doing so will be $M\mathrm{log}B$. We now analyze how the choice of ${r}_{1},\mathrm{\dots},{r}_{B}$ affects our protocols performance.
Now define $f(X,r)$ to be the expected regret for our protocol uses the public random string $r$ and interacts with the multiarmed bandit $X$. Our analysis above tells us that $\forall X$,
${\mathbb{E}}_{r}\left[f(X,r)\right]\le {C}_{1}\sqrt{MKT\mathrm{log}T}.$ 
Therefore, if we draw i.i.d. ${r}_{1},\mathrm{\dots},{r}_{B}$,
${\mathbb{E}}_{{r}_{1},\mathrm{\dots},{r}_{B}}\left[{\displaystyle \frac{1}{B}}{\displaystyle \sum _{i=1}^{B}}f(X,{r}_{i})\right]\le {C}_{1}\sqrt{MKT\mathrm{log}T}.$ 
We say that a set of random strings is bad for a bandit $X$ if
$$\frac{1}{B}\sum _{i=1}^{B}f(X,{r}_{i})>2{C}_{1}\sqrt{MKT\mathrm{log}T}.$$ 
We know that $0\le f(X,{r}_{i})\le MT$. Therefore, by Hoeffding’s inequality,
$\underset{{r}_{1},\mathrm{\dots},{r}_{B}}{\mathrm{Pr}}\left[{\displaystyle \frac{1}{B}}{\displaystyle \sum _{i=1}^{B}}f(X,{r}_{i})>2{C}_{1}\sqrt{MKT\mathrm{log}T}\right]\le \mathrm{exp}\left\{{\displaystyle \frac{2B{C}_{1}^{2}K\mathrm{ln}T}{MT}}\right\}.$ 
In other words, for fixed $X$, the probability that we will draw a bad set $\{{r}_{1},\mathrm{\dots},{r}_{B}\}$ is exponentially small. Therefore, for a family of bandits with size $Q$, the probability for drawing a set of ${r}_{1},\mathrm{\dots},{r}_{B}$ that is bad for some bandit is at most $Q\cdot \mathrm{exp}\left\{\frac{2B{C}_{1}^{2}K\mathrm{ln}T}{MT}\right\}$. If we can show that this is smaller than $1$, it would follow that there exists ${r}_{1}$,…,${r}_{B}$ such that it is good for every bandit in the set.
Now, we consider the following family $\mathcal{X}$ of bandits. For each arm, the expected reward could be $a/\mathrm{\Delta}$, where $0\le a\le {\mathrm{\Delta}}^{1}$. The size of this family is $Q={\mathrm{\Delta}}^{K}$. Now, consider any other bandit $X$. Apparently we can find a bandit ${X}^{\prime}\in \mathcal{X}$ such that their expected rewards are $\mathrm{\Delta}$close in $\parallel \cdot {\parallel}_{\mathrm{\infty}}$. Then, by applying informationtheoretic arguments used in the proof for regret lower bounds [18], we can show that for any policy, their expected rewards differ by at most $O\left(MT\sqrt{MT{\mathrm{\Delta}}^{2}}\right)$. Note that a communication protocol is just a special form a singleagent policy on a $MT$length trial; so this bound applies just as well. Therefore, if $\mathrm{\Delta}={(MT)}^{1.5}$, it suffices to consider bandits in $\mathcal{X}$. In this case, $Q={(MT)}^{1.5K}$.
Therefore, we only need guarantee that $\frac{BK\mathrm{log}T}{MT}>{C}^{\prime}K\mathrm{log}MT$, where ${C}^{\prime}$ is a universal constant. This can be met by setting $B={C}^{\prime}MT\left(\mathrm{log}(M)+1\right)$. In this case, the additional communication overhead is
$$O\left(M\mathrm{log}B\right)=O\left(M\mathrm{log}(MT)\right).$$ 
Therefore, under our usual assumption of $T>M+K$, total communication is bounded by $O\left(M\mathrm{log}T\right)$. ∎
Appendix C C Proof for Theorem 5
Proof.
First of all, we list two lemmas that will be used in our proof.
Lemma C.1.
(Theorem 9.1 [18]) For $K$armed bandits, there is an algorithm with expected regret
$$REG(T)\le 38\sqrt{KT}.$$ 
Lemma C.2.
(Theorem 15.2 [18]) For $K$armed bandits, we can prove a minimax regret lower bound of
$$REG(T)\ge \frac{1}{75}\sqrt{(K1)T}.$$ 
The original lower bound is proved for Gaussian bandits, which doesn’t fit exactly in our setting. we modified the proof to work for Bernoulli bandits, which resulted in a different constant.
We now prove the theorem’s statement via a reduction from single agent armed bandit to multiagent armed bandit. That is, we map communication protocols to singleagent algorithms in the following way. Consider a communication protocol with communication complexity $B(M)$. We denote ${X}_{i}$ ($i\in [M]$) to be the indicator function for agent $i$’s sending or receiving a data packet throughout a run. ${X}_{i}$ is a random variable. Since expected communication complexity is less than $M/c$,
$$\sum _{i}\mathbb{E}{X}_{i}\le M/c.$$ 
Now consider the $M/2$ agents with smallest $\mathbb{E}{X}_{i}$. Apparently all of them have $\mathbb{E}{X}_{i}\le 2/c$. That is, for any of these agents, the probability of either speaking or hearing from someone is less than $2/c$. Suppose that agent $j$ is such a agent. Then, we can map the communication protocol to a singleagent algorithm by simulating agent $j$.
The simulation is as follows. Facing single agent bandit with time $T$, we run the code for agent $i$ in the protocol. When no communication is needed, we are fine (continue to the next line of code). When the code mentions anything related with communication (e.g. send a message, wait for a message, etc.), we terminate the code. For the rest of the timesteps, we run a singleagent optimal algorithm (the one used by lemma C.1).
Then, if agent $j$’s code has $\delta $ probability for involving in communication, and agent $j$’s regret $RE{G}_{j}(T)\le A$, via this reduction, we can obtain an algorithm with expected regret
$$REG(T)\le A+\delta \cdot 38\sqrt{KT}.$$ 
By lemma $2$, $REG(T)$ cannot have a regret upperbound better than $\sqrt{T(K1)}/75$. Therefore
$$A+\delta \cdot 38\sqrt{KT}\ge \sqrt{(K1)T}/75.$$ 
If $$, we can show that $A=\mathrm{\Omega}\left(\sqrt{KT}\right)$. In our case, let $c=3000$ will suffice. Then, since we can show this for $M/2$ agents, we can show that total regret is $\mathrm{\Omega}\left(M\sqrt{KT}\right)$. ∎