Collaboratively Learning the Best Option, Using Bounded Memory

Abstract

We consider multi-armed bandit problems in social groups wherein eachindividual has bounded memory and shares the common goal of learning the bestarm/option. We say an individual learns the best option if eventually (as $t\to \infty$) it pulls only the arm with the highest average reward. While thisgoal is provably impossible for an isolated individual, we show that, in socialgroups, this goal can be achieved easily with the aid of social persuasion,i.e., communication. Specifically, we study the learning dynamics wherein anindividual sequentially decides on which arm to pull next based on not only itsprivate reward feedback but also the suggestions provided by randomly chosenpeers. Our learning dynamics are hard to analyze via explicit probabilisticcalculations due to the stochastic dependency induced by social interaction.Instead, we employ the mean-field approximation method from statistical physicsand we show: (1) With probability $\to 1$ as the social group size $N \to \infty $, everyindividual in the social group learns the best option. (2) Over an arbitrary finite time horizon $[0, T]$, with high probability (in$N$), the fraction of individuals that prefer the best option grows to 1exponentially fast as $t$ increases ($t\in [0, T]$). A major innovation of our mean-filed analysis is a simple yet powerfultechnique to deal with absorbing states in the interchange of limits $N \to\infty$ and $t \to \infty $. The mean-field approximation method allows us toapproximate the probabilistic sample paths of our learning dynamics by adeterministic and smooth trajectory that corresponds to the unique solution ofa well-behaved system of ordinary differential equations (ODEs). Such anapproximation is desired because the analysis of a system of ODEs is relativelyeasier than that of the original stochastic system.

Quick Read (beta)

loading the full paper ...