Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

  • 2021-01-07 18:57:11
  • Dongruo Zhou, Quanquan Gu, Csaba Szepesvari
  • 0

Abstract

We study reinforcement learning (RL) with linear function approximation wherethe underlying transition probability kernel of the Markov decision process(MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou etal., 2020) and the learning agent has access to either an integration or asampling oracle of the individual basis kernels. We propose a newBernstein-type concentration inequality for self-normalized martingales forlinear bandit problems with bounded noise. Based on the new inequality, wepropose a new, computationally efficient algorithm with linear functionapproximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixtureMDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of featuremapping, $H$ is the length of the episode and $T$ is the number of interactionswith the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ forthis setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up tologarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithmfor the same family of MDPs under discounting and show that it attains an$\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is thediscount factor. Our upper bound matches the lower bound$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved by Zhou et al. (2020) up tologarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimaxoptimal. To the best of our knowledge, these are the first computationallyefficient, nearly minimax optimal algorithms for RL with linear functionapproximation.

 

Quick Read (beta)

loading the full paper ...