Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Abstract

We study reinforcement learning (RL) with linear function approximation wherethe underlying transition probability kernel of the Markov decision process(MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou etal., 2020) and the learning agent has access to either an integration or asampling oracle of the individual basis kernels. We propose a newBernstein-type concentration inequality for self-normalized martingales forlinear bandit problems with bounded noise. Based on the new inequality, wepropose a new, computationally efficient algorithm with linear functionapproximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixtureMDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of featuremapping, $H$ is the length of the episode and $T$ is the number of interactionswith the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ forthis setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up tologarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithmfor the same family of MDPs under discounting and show that it attains an$\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is thediscount factor. Our upper bound matches the lower bound$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved by Zhou et al. (2020) up tologarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimaxoptimal. To the best of our knowledge, these are the first computationallyefficient, nearly minimax optimal algorithms for RL with linear functionapproximation.

Quick Read (beta)

loading the full paper ...