Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective

Abstract

In the classical multi-armed bandit problem, instance-dependent algorithmsattain improved performance on "easy" problems with a gap between the best andsecond-best arm. Are similar guarantees possible for contextual bandits? Whilepositive results are known for certain special cases, there is no generaltheory characterizing when and how instance-dependent regret bounds forcontextual bandits can be achieved for rich, general classes of policies. Weintroduce a family of complexity measures that are both sufficient andnecessary to obtain instance-dependent regret bounds. We then introduce neworacle-efficient algorithms which adapt to the gap whenever possible, whilealso attaining the minimax rate in the worst case. Finally, we providestructural results that tie together a number of complexity measures previouslyproposed throughout contextual bandits, reinforcement learning, and activelearning and elucidate their role in determining the optimal instance-dependentregret. In a large-scale empirical evaluation, we find that our approach oftengives superior results for challenging exploration problems. Turning our focus to reinforcement learning with function approximation, wedevelop new oracle-efficient algorithms for reinforcement learning with richobservations that obtain optimal gap-dependent sample complexity.

Quick Read (beta)

loading the full paper ...