Information Directed Sampling and Bandits with Heteroscedastic Noise

Abstract

In the stochastic bandit problem, the goal is to maximize an unknown functionvia a sequence of noisy evaluations. Typically, the observation noise isassumed to be independent of the evaluation point and to satisfy a tail bounduniformly on the domain; a restrictive assumption for many applications. Inthis work, we consider bandits with heteroscedastic noise, where we explicitlyallow the noise distribution to depend on the evaluation point. We show thatthis leads to new trade-offs for information and regret, which are not takeninto account by existing approaches like upper confidence bound algorithms(UCB) or Thompson Sampling. To address these shortcomings, we introduce afrequentist regret analysis framework, that is similar to the Bayesianframework of Russo and Van Roy (2014), and we prove a new high-probabilityregret bound for general, possibly randomized policies, which depends on aquantity we refer to as regret-information ratio. From this bound, we define afrequentist version of Information Directed Sampling (IDS) to minimize theregret-information ratio over all possible action sampling distributions. Thisfurther relies on concentration inequalities for online least squaresregression in separable Hilbert spaces, which we generalize to the case ofheteroscedastic noise. We then formulate several variants of IDS for linear andreproducing kernel Hilbert space response functions, yielding novel algorithmsfor Bayesian optimization. We also prove frequentist regret bounds, which inthe homoscedastic case recover known bounds for UCB, but can be much betterwhen the noise is heteroscedastic. Empirically, we demonstrate in a linearsetting with heteroscedastic noise, that some of our methods can outperform UCBand Thompson Sampling, while staying competitive when the noise ishomoscedastic.

Quick Read (beta)

loading the full paper ...