Second Order Bounds for Contextual Bandits with Function Approximation

Abstract

Many works have developed algorithms no-regret algorithms for contextualbandits with function approximation, where the mean rewards over context-actionpairs belongs to a function class. Although there are many approaches to thisproblem, one that has gained in importance is the use of algorithms based onthe optimism principle such as optimistic least squares. It can be shown theregret of this algorithm scales as square root of the product of the eluderdimension (a statistical measure of the complexity of the function class), thelogarithm of the function class size and the time horizon. Unfortunately, evenif the variance of the measurement noise of the rewards at each time ischanging and is very small, the regret of the optimistic least squaresalgorithm scales with square root of the time horizon. In this work we are thefirst to develop algorithms that satisfy regret bounds of scaling not with thesquare root of the time horizon, but the square root of the sum of themeasurement variances in the setting of contextual bandits with functionapproximation when the variances are unknown. These bounds generalize existingtechniques for deriving second order bounds in contextual linear problems.

Quick Read (beta)

loading the full paper ...