Abstract
We propose novel classical and quantum online algorithms for learningfinite-horizon and infinite-horizon average-reward Markov Decision Processes(MDPs). Our algorithms are based on a hybrid exploration-generativereinforcement learning (RL) model wherein the agent can, from time to time,freely interact with the environment in a generative sampling fashion, i.e., byhaving access to a "simulator". By employing known classical and new quantumalgorithms for approximating optimal policies under a generative model withinour learning algorithms, we show that it is possible to avoid several paradigmsfrom RL like "optimism in the face of uncertainty" and "posterior sampling" andinstead compute and use optimal policies directly, which yields better regretbounds compared to previous works. For finite-horizon MDPs, our quantumalgorithms obtain regret bounds which only depend logarithmically on the numberof time steps $T$, thus breaking the $O(\sqrt{T})$ classical barrier. Thismatches the time dependence of the prior quantum works of Ganguly et al.(arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on otherparameters like state space size $S$ and action space size $A$. Forinfinite-horizon MDPs, our classical and quantum bounds still maintain the$O(\sqrt{T})$ dependence but with better $S$ and $A$ factors. Nonetheless, wepropose a novel measure of regret for infinite-horizon MDPs with respect towhich our quantum algorithms have $\operatorname{poly}\log{T}$ regret,exponentially better compared to classical algorithms. Finally, we generaliseall of our results to compact state spaces.