Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints

Abstract

We study reinforcement learning (RL) with linear function approximation underthe adaptivity constraint. We consider two popular limited adaptivity models:the batch learning model and the rare policy switch model, and propose twoefficient online RL algorithms for episodic linear Markov decision processes,where the transition probability and the reward function can be represented asa linear function of some known feature mapping. In specific, for the batchlearning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tildeO(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the featuremapping, $H$ is the episode length, $T$ is the number of interactions and $B$is the number of batches. Our result suggests that it suffices to use only$\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rarepolicy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an$\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\logT$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Ouralgorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al.,2019), yet with a substantially smaller amount of adaptivity. We also establisha lower bound for the batch learning model, which suggests that the dependencyon $B$ in our regret bound is tight.

Quick Read (beta)

loading the full paper ...