Abstract
Reducing reinforcement learning to supervised learning is a well-studied andeffective approach that leverages the benefits of compact functionapproximation to deal with large-scale Markov decision processes.Independently, the boosting methodology (e.g. AdaBoost) has proven to beindispensable in designing efficient and accurate classification algorithms bycombining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to asequence of weak learning problems. Since weak learners perform only marginallybetter than random guesses, such subroutines constitute a weaker assumptionthan the availability of an accurate supervised learning oracle. We prove thatthe sample complexity and running time bounds of the proposed method do notexplicitly depend on the number of states. While existing results on boosting operate on convex losses, the valuefunction over policies is non-convex. We show how to use a non-convex variantof the Frank-Wolfe method for boosting, that additionally improves upon theknown sample complexity and running time even for reductions to supervisedlearning.