Nonstationary Reinforcement Learning with Linear Function Approximation

Abstract

We consider reinforcement learning (RL) in episodic Markov decision processes(MDPs) with linear function approximation under drifting environment.Specifically, both the reward and state transition functions can evolve overtime, as long as their respective total variations, quantified by suitablemetrics, do not exceed certain \textit{variation budgets}. We first develop$\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification ofleast-squares value iteration combined with periodic restart, and establish itsdynamic regret bound when variation budgets are known. We then propose aparameter-free algorithm, \texttt{Ada-LSVI-UCB-Restart}, that works withoutknowing the variation budgets, but with a slightly worse dynamic regret bound.We also derive the first minimax dynamic regret lower bound for nonstationaryMDPs to show that our proposed algorithms are near-optimal. As a byproduct, weestablish a minimax regret lower bound for linear MDPs, which is unsolved by\cite{jin2020provably}. As far as we know, this is the first dynamic regretanalysis in nonstationary reinforcement learning with function approximation.

Quick Read (beta)

loading the full paper ...