Abstract
We study the task of bandit learning, also known as best-arm identification,under the assumption that the true reward function f belongs to a known, butarbitrary, function class F. We seek a general theory of bandit learnability,akin to the PAC framework for classification. Our investigation is guided bythe following two questions: (1) which classes F are learnable, and (2) howthey are learnable. For example, in the case of binary PAC classification,learnability is fully determined by a combinatorial dimension - the VCdimension- and can be attained via a simple algorithmic principle, namely,empirical risk minimization (ERM). In contrast to classical learning-theoreticresults, our findings reveal limitations of learning in structured bandits,offering insights into the boundaries of bandit learnability. First, for thequestion of "which", we show that the paradigm of identifying the learnableclasses via a dimension-like quantity fails for bandit learning. We give asimple proof demonstrating that no combinatorial dimension can characterizebandit learnability, even in finite classes, following a standard definition ofdimension introduced by Ben-David et al. (2019). For the question of "how", weprove a computational hardness result: we construct a reward function class forwhich at most two queries are needed to find the optimal action, yet noalgorithm can do so in polynomial time unless RP=NP. We also prove that thisclass admits efficient algorithms for standard algorithmic operations oftenconsidered in learning theory, such as an ERM. This implies that computationalhardness is in this case inherent to the task of bandit learning. Beyond theseresults, we investigate additional themes such as learning under noise,trade-offs between noise models, and the relationship between query complexityand regret minimization.