Efficient, Low-Regret, Online Reinforcement Learning for Linear MDPs

  • 2024-11-16 22:51:52
  • Philips George John, Arnab Bhattacharyya, Silviu Maniu, Dimitrios Myrisiotis, Zhenan Wu
  • 0

Abstract

Reinforcement learning algorithms are usually stated without theoreticalguarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan(COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely,LSVI-UCB) for the setting of linear Markov decision processes, and providedtheoretical guarantees regarding its running time and regret. In real-worldscenarios, however, the space usage of this algorithm can be prohibitive due toa utilized linear regression step. We propose and analyze two modifications ofLSVI-UCB, which alternate periods of learning and not-learning, to reduce spaceand time usage while maintaining sublinear regret. We show experimentally, onsynthetic data and real-world benchmarks, that our algorithms achieve low spaceusage and running time, while not significantly sacrificing regret.

 

Quick Read (beta)

loading the full paper ...