New Insights into Bootstrapping for Bandits

Abstract

We investigate the use of bootstrapping in the bandit setting. We first showthat the commonly used non-parametric bootstrapping (NPB) procedure can beprovably inefficient and establish a near-linear lower bound on the regretincurred by it under the bandit model with Bernoulli rewards. We show that NPBwith an appropriate amount of forced exploration can result in sub-linearalbeit sub-optimal regret. As an alternative to NPB, we propose a weightedbootstrapping (WB) procedure. For Bernoulli rewards, WB with multiplicativeexponential weights is mathematically equivalent to Thompson sampling (TS) andresults in near-optimal regret bounds. Similarly, in the bandit setting withGaussian rewards, we show that WB with additive Gaussian weights achievesnear-optimal regret. Beyond these special cases, we show that WB leads tobetter empirical performance than TS for several reward distributions boundedon $[0,1]$. For the contextual bandit setting, we give practical guidelinesthat make bootstrapping simple and efficient to implement and result in goodempirical performance on real-world datasets.

Quick Read (beta)

loading the full paper ...