What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

Abstract

An online reinforcement learning algorithm is anytime if it does not need toknow in advance the horizon T of the experiment. A well-known technique toobtain an anytime algorithm from any non-anytime algorithm is the "DoublingTrick". In the context of adversarial or stochastic multi-armed bandits, theperformance of an algorithm is measured by its regret, and we study twofamilies of sequences of growing horizons (geometric and exponential) togeneralize previously known results that certain doubling tricks can be used toconserve certain regret bounds. In a broad setting, we prove that a geometricdoubling trick can be used to conserve (minimax) bounds in $R\_T = O(\sqrt{T})$but cannot conserve (distribution-dependent) bounds in $R\_T = O(\log T)$. Wegive insights as to why exponential doubling tricks may be better, as theyconserve bounds in $R\_T = O(\log T)$, and are close to conserving bounds in$R\_T = O(\sqrt{T})$.

Quick Read (beta)

loading the full paper ...