Reinforcement Learning under Drift

Abstract

We propose algorithms with state-of-the-art \emph{dynamic regret} bounds forun-discounted reinforcement learning under drifting non-stationarity, whereboth the reward functions and state transition distributions are allowed toevolve over time. Our main contributions are: 1) A tuned Sliding WindowUpper-Confidence bound for Reinforcement Learning with Confidence-Widening(\texttt{SWUCRL2-CW}) algorithm, which attains low dynamic regret boundsagainst the optimal non-stationary policy in various cases. 2) TheBandit-over-Reinforcement Learning (\texttt{BORL}) framework that furtherpermits us to enjoy these dynamic regret bounds in a parameter-free manner.

Quick Read (beta)

loading the full paper ...