Abstract
We consider a finite-armed structured bandit problem in which mean rewards ofdifferent arms are functions of a common hidden parameter $\theta^*$. Thisproblem setting subsumes several previously studied frameworks that assumelinear or invertible reward functions. Our approach exploits the structure inthe problem gradually and pulls only some of the sub-optimal arms O(log T)times, while other sub-optimal arms, termed as non-competitive, are pulled onlyO(1) times. Put differently, the set of non-competitive arms, which depend onthe hidden parameter $\theta^*$, are stopped being pulled after some finitetime. We show how this approach can be transformed into a general algorithmthat can be coupled with any classic bandit strategy (UCB, Thompson Sampling,KL-UCB etc.), allowing them to be used in the structured bandit setting withsubstantial reductions in regret. In particular, we get bounded regret inseveral cases of practical interest where all sub-optimal arms arenon-competitive. We also demonstrate the superiority of our algorithms overexisting methods (including UCB-S) via experiments on the Movielens dataset.