Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

  • 2020-06-24 15:40:21
  • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
  • 0


We consider un-discounted reinforcement learning (RL) in Markov decisionprocesses (MDPs) under drifting non-stationarity, i.e., both the reward andstate transition distributions are allowed to evolve over time, as long astheir respective total variations, quantified by suitable metrics, do notexceed certain variation budgets. We first develop the Sliding WindowUpper-Confidence bound for Reinforcement Learning with Confidence Widening(SWUCRL2-CW) algorithm, and establish its dynamic regret bound when thevariation budgets are known. In addition, we propose theBandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune theSWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in aparameter-free manner, i.e., without knowing the variation budgets. Notably,learning non-stationary MDPs via the conventional optimistic explorationtechnique presents a unique challenge absent in existing (non-stationary)bandit learning settings. We overcome the challenge by a novel confidencewidening technique that incorporates additional optimism.


Quick Read (beta)

loading the full paper ...