Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Abstract

We introduce SCAL, an algorithm designed to perform efficientexploration-exploitation in any unknown weakly-communicating Markov DecisionProcess (MDP) for which an upper bound c on the span of the optimal biasfunction is known. For an MDP with S states, A actions and Gamma <= S possiblenext states, we prove a regret bound of O(c\sqrt{Gamma SAT}), whichsignificantly improves over existing algorithms (e.g., UCRL and PSRL), whoseregret scales linearly with the MDP diameter D. In fact, the optimal bias spanis finite and often much smaller than D (e.g., D=infinity in non-communicatingMDPs). A similar result was originally derived by Bartlett and Tewari (2009)for REGAL.C, for which no tractable algorithm is available. In this paper, werelax the optimization problem at the core of REGAL.C, we carefully analyze itsproperties, and we provide the first computationally efficient algorithm tosolve it. Finally, we report numerical simulations supporting our theoreticalfindings and showing how SCAL significantly outperforms UCRL in MDPs with largediameter and small span.

Quick Read (beta)

loading the full paper ...