Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Abstract

We investigate two perturbation approaches to overcome conservatism thatoptimism based algorithms chronically suffer from in practice. The firstapproach replaces optimism with a simple randomization when using confidencesets. The second one adds random perturbations to its current estimate beforemaximizing the expected reward. For non-stationary linear bandits, where eachaction is associated with a $d$-dimensional feature and the unknown parameteris time-varying with total variation $B_T$, we propose two randomizedalgorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted LinearThompson Sampling (D-LinTS) via the two perturbation approaches. We highlightthe statistical optimality versus computational efficiency trade-off betweenthem in that the former asymptotically achieves the optimal dynamic regret$\tilde{\mathcal{O}}( d ^{2/3}B_T^{1/3} T^{2/3})$, but the latter isoracle-efficient with an extra logarithmic gap in number of arms compared tominimax-optimal dynamic regret. In a simulation study, both empirically showthe outstanding performance in tackling conservatism issue that DiscountedLinUCB (D-LinUCB) struggles with.

Quick Read (beta)

loading the full paper ...