Sparse Nonparametric Contextual Bandits

Abstract

This paper studies the problem of simultaneously learning relevant featuresand minimising regret in contextual bandit problems. We introduce and analyse anew class of contextual bandit problems, called sparse nonparametric contextualbandits, in which the expected reward function lies in the linear span of asmall unknown set of features that belongs to a known infinite set of candidatefeatures. We consider two notions of sparsity, for which the set of candidatefeatures is either countable or uncountable. Our contribution is two-fold.First, we provide lower bounds on the minimax regret, which show thatpolynomial dependence on the number of actions is generally unavoidable in thissetting. Second, we show that a variant of the Feel-Good Thompson Samplingalgorithm enjoys regret bounds that match our lower bounds up to logarithmicfactors of the horizon, and have logarithmic dependence on the effective numberof candidate features. When we apply our results to kernelised and neuralcontextual bandits, we find that sparsity always enables better regret bounds,as long as the horizon is large enough relative to the sparsity and the numberof actions.

Quick Read (beta)

loading the full paper ...