Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism

Abstract

We consider un-discounted reinforcement learning (RL) in Markov decisionprocesses (MDPs) under temporal drifts, ie, both the reward and statetransition distributions are allowed to evolve over time, as long as theirrespective total variations, quantified by suitable metrics, do not exceedcertain variation budgets. This setting captures the endogeneity, exogeneity,uncertainty, and partial feedback in sequential decision-making scenarios, andfinds applications in vehicle remarketing and real-time bidding. We firstdevelop the Sliding Window Upper-Confidence bound for Reinforcement Learningwith Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamicregret bound when the variation budgets are known. In addition, we propose theBandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune theSWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in aparameter-free manner, ie, without knowing the variation budgets. Finally, weconduct numerical experiments to show that our proposed algorithms achievesuperior empirical performance compared to existing algorithms. Notably, the interplay between endogeneity and exogeneity presents a uniquechallenge, absent in existing (stationary and non-stationary) stochastic onlinelearning settings, when we apply the conventional Optimism in Face ofUncertainty principle to design algorithms with provably low dynamic regret forRL in drifting MDPs. We overcome the challenge by a novel confidence wideningtechnique that incorporates additional optimism into our learning algorithms toensure low dynamic regret bounds. To extend our theoretical findings, we applyour framework to inventory control problems, and demonstrate how one canalternatively leverage special structures on the state transition distributionsto bypass the difficulty in exploring time-varying environments.

Quick Read (beta)

loading the full paper ...