Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version

Abstract

This work tackles the problem of robust zero-shot planning in non-stationarystochastic environments. We study Markov Decision Processes (MDPs) evolvingover time and consider Model-Based Reinforcement Learning algorithms in thissetting. We make two hypotheses: 1) the environment evolves continuously with abounded evolution rate; 2) a current model is known at each decision epoch butnot its evolution. Our contribution can be presented in four points. 1) wedefine a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). Weintroduce the notion of regular evolution by making an hypothesis ofLipschitz-Continuity on the transition and reward functions w.r.t. time; 2) weconsider a planning agent using the current model of the environment butunaware of its future evolution. This leads us to consider a worst-case methodwhere the environment is seen as an adversarial agent; 3) following thisapproach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shotModel-Based method similar to Minimax search; 4) we illustrate the benefitsbrought by RATS empirically and compare its performance with referenceModel-Based algorithms.

Quick Read (beta)

loading the full paper ...