Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret

Abstract

We study the communication complexity of distributed multi-armed bandits(MAB) and distributed linear bandits for regret minimization. We proposecommunication protocols that achieve near-optimal regret bounds and result inoptimal speed-up under mild conditions. We measure the communication cost ofprotocols by the total number of communicated numbers. For multi-armed bandits,we give two protocols that require little communication cost, one isindependent of the time horizon $ T $ and the other is independent of thenumber of arms $ K $. In particular, for a distributed $K$-armed bandit with$M$ agents, our protocols achieve near-optimal regret $O(\sqrt{MKT\log T})$with $O\left(M\log T\right)$ and $O\left(MK\log M\right)$ communication costrespectively. We also propose two protocols for $d$-dimensional distributedlinear bandits that achieve near-optimal regret with $O(M^{1.5}d^3)$ and$O\left((Md+d\log\log d)\log T\right)$ communication cost respectively. Thecommunication cost can be independent of $T$, or almost linear in $d$.

Quick Read (beta)

loading the full paper ...