Imprecise Multi-Armed Bandits

Abstract

We introduce a novel multi-armed bandit framework, where each arm isassociated with a fixed unknown credal set over the space of outcomes (whichcan be richer than just the reward). The arm-to-credal-set correspondence comesfrom a known class of hypotheses. We then define a notion of regretcorresponding to the lower prevision defined by these credal sets.Equivalently, the setting can be regarded as a two-player zero-sum game, where,on each round, the agent chooses an arm and the adversary chooses thedistribution over outcomes from a set of options associated with this arm. Theregret is defined with respect to the value of game. For certain naturalhypothesis classes, loosely analgous to stochastic linear bandits (which are aspecial case of the resulting setting), we propose an algorithm and prove acorresponding upper bound on regret. We also prove lower bounds on regret forparticular special cases.

Quick Read (beta)

loading the full paper ...