Abstract
This work tackles the problem of robust zeroshot planning in nonstationarystochastic environments. We study Markov Decision Processes (MDPs) evolvingover time and consider ModelBased Reinforcement Learning algorithms in thissetting. We make two hypotheses: 1) the environment evolves continuously with abounded evolution rate; 2) a current model is known at each decision epoch butnot its evolution. Our contribution can be presented in four points. 1) wedefine a specific class of MDPs that we call NonStationary MDPs (NSMDPs). Weintroduce the notion of regular evolution by making an hypothesis ofLipschitzContinuity on the transition and reward functions w.r.t. time; 2) weconsider a planning agent using the current model of the environment butunaware of its future evolution. This leads us to consider a worstcase methodwhere the environment is seen as an adversarial agent; 3) following thisapproach, we propose the RiskAverse TreeSearch (RATS) algorithm, a zeroshotModelBased method similar to Minimax search; 4) we illustrate the benefitsbrought by RATS empirically and compare its performance with referenceModelBased algorithms.
Quick Read (beta)
NonStationary Markov Decision Processes
a WorstCase Approach using ModelBased Reinforcement Learning
Abstract
This work tackles the problem of robust planning in nonstationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider ModelBased Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call NonStationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of LipschitzContinuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worstcase method where the environment is seen as an adversarial agent; 3) following this approach, we propose the RiskAverse TreeSearch (RATS) algorithm, a ModelBased method similar to minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference ModelBased algorithms.
NonStationary Markov Decision Processes
a WorstCase Approach using ModelBased Reinforcement Learning
Erwan Lecarpentier Université de Toulouse ONERA  The French Aerospace Lab [email protected] Emmanuel Rachelson Université de Toulouse ISAESUPAERO [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
One of the hot topics of modern Artificial Intelligence (AI) is the ability for an agent to adapt its behavior to changing tasks. In the literature, this problem is often linked to the setting of Lifelong Reinforcement Learning (LRL) (silver2013lifelong; abel2018state; abel2018policy) and learning in nonstationary environments (choi1999hidden; jaulmes2005learning; hadoux2015markovian). In LRL, the tasks presented to the agent change sequentially at discrete transition epochs (silver2013lifelong). Similarly, the nonstationary environments considered in the literature often evolve abruptly (hadoux2015markovian; hadoux2014sequential; doya2002multiple; da2006dealing; choi1999hidden; choi2000hidden; choi2001solving; campo1991state; wiering2001reinforcement). In this paper, we investigate environments continuously changing over time that we call NonStationary Markov Decision Processes (NSMDPs). In this setting, it is realistic to bound the evolution rate of the environment using a Lipschitz Continuity (LC) assumption.
Modelbased Reinforcement Learning approaches (sutton1998reinforcement) benefit from the knowledge of a model allowing them to reach impressive performances, as demonstrated by the Monte Carlo Tree Search (MCTS) algorithm (silver2016mastering). In this matter, the necessity to have access to a model is a great concern of AI (asadi2018lipschitz; jaulmes2005learning; doya2002multiple; da2006dealing). In the context of NSMDPs, we assume that an agent is provided with a snapshot model when its action is computed. By this, we mean that it only has access to the current model of the environment but not its future evolution, as if it took a photograph but would be unable to predict how it is going to evolve. This hypothesis is realistic, because many environments have a tractable state while their future evolution is hard to predict (da2006dealing; wiering2001reinforcement). In order to solve LCNSMDPs, we propose a method that considers the worstcase possible evolution of the model and performs planning w.r.t. this model. This is equivalent to considering Nature as an adversarial agent. The paper is organized as follows: first we describe the NSMDP setting and the regularity assumption (Section 2); then we outline related works (Section 3); follows the explanation of the worstcase approach proposed in this paper (Section 4); then we describe an algorithm reflecting this approach (Section 5); finally we illustrate its behavior empirically (Section 6).
2 NonStationary Markov Decision Processes
To define a NonStationary Markov Decision Process (NSMDP), we revert to the initial MDP model introduced by puterman2014markov, where the transition and reward functions depend on time.
Definition 1.
NSMDP. An NSMDP is an MDP whose transition and reward functions depend on the decision epoch. It is defined by a 5tuple $\mathrm{\{}\mathrm{S}\mathrm{,}\mathrm{T}\mathrm{,}\mathrm{A}\mathrm{,}{\mathrm{\left(}{\mathrm{p}}_{\mathrm{t}}\mathrm{\right)}}_{\mathrm{t}\mathrm{\in}\mathrm{T}}\mathrm{,}{\mathrm{\left(}{\mathrm{r}}_{\mathrm{t}}\mathrm{\right)}}_{\mathrm{t}\mathrm{\in}\mathrm{T}}\mathrm{\}}$ where $\mathrm{S}$ is a state space; $\mathrm{T}\mathrm{\equiv}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{,}\mathrm{\dots}\mathrm{,}\mathrm{N}\mathrm{\}}$ is the set of decision epochs with $\mathrm{N}\mathrm{\le}\mathrm{+}\mathrm{\infty}$; $\mathrm{A}$ is an action space; ${\mathrm{p}}_{\mathrm{t}}\mathrm{}\mathrm{(}{\mathrm{s}}^{\mathrm{\prime}}\mathrm{\mid}\mathrm{s}\mathrm{,}\mathrm{a}\mathrm{)}$ is the probability of reaching state ${\mathrm{s}}^{\mathrm{\prime}}$ while performing action $\mathrm{a}$ at decision epoch $\mathrm{t}$ in state $\mathrm{s}$; ${\mathrm{r}}_{\mathrm{t}}\mathrm{}\mathrm{(}\mathrm{s}\mathrm{,}\mathrm{a}\mathrm{,}{\mathrm{s}}^{\mathrm{\prime}}\mathrm{)}$ is the scalar reward associated to the transition from $\mathrm{s}$ to ${\mathrm{s}}^{\mathrm{\prime}}$ with action $\mathrm{a}$ at decision epoch $\mathrm{t}$.
This definition can be viewed as that of a stationary MDP whose state space has been enhanced with time. While this addition is trivial in episodic tasks where an agent is given the opportunity to interact several times with the same MDP, it is different when the experience is unique. Indeed, no exploration is allowed along the temporal axis. Within a stationary, infinitehorizon MDP with a discounted criterion, it is proven that there exists a Markovian deterministic stationary policy (puterman2014markov). It is not the case within NSMDPs where the optimal policy is nonstationary in the most general case. Additionally, we define the expected reward received when taking action $a$ at state $s$ and decision epoch $t$ as ${R}_{t}(s,a)={\mathbb{E}}_{{s}^{\prime}\sim {p}_{t}(\cdot \mid s,a)}\left[{r}_{t}(s,a,{s}^{\prime})\right]$. Without loss of generality, we assume the reward function to be bounded between $1$ and $1$. In this paper, we consider discrete time decision processes with constant transition durations, which imply deterministic decision times in Definition 1. This assumption is mild since many discrete time sequential decision problems follow that assumption. A nonstationary policy $\pi $ is a sequence of decision rules ${\pi}_{t}$ which map states to actions (or distributions over actions). For a stochastic nonstationary policy ${\pi}_{t}(a\mid s)$, the value of a state $s$ at decision epoch $t$ within an infinite horizon NSMDP is defined, with $\gamma \in [0,1)$ a discount factor, by:
$${V}_{t}^{\pi}(s)=\mathbb{E}\left[\sum _{i=t}^{\mathrm{\infty}}{\gamma}^{it}{R}_{i}({s}_{i},{a}_{i})\begin{array}{cc}& {s}_{t}=s,{a}_{i}\sim {\pi}_{i}(\cdot \mid {s}_{i}),{s}_{i+1}\sim {p}_{i}(\cdot \mid {s}_{i},{a}_{i})\hfill \end{array}\right],$$ 
The definition of the stateaction value function ${Q}_{t}^{\pi}$ for $\pi $ at decision epoch $t$ is straightforward:
$${Q}_{t}^{\pi}(s,a)={R}_{t}(s,a)+\gamma \underset{{s}^{\prime}\sim {p}_{t}(\cdot \mid s,a)}{\mathbb{E}}\left[{V}_{t+1}^{\pi}({s}^{\prime})\right].$$ 
Overall, we defined an NSMDP as an MDP where we stress out the distinction between state, time, and decision epoch due to the inability for an agent to explore the temporal axis at will. This distinction is particularly relevant for nonepisodic tasks, i.e. when there is no possibility to reexperience the same MDP starting from a prior date.
The regularity hypothesis. Many realworld problems can be modeled as an NSMDP. For instance, the problem of path planning for a glider immersed in a nonstationary atmosphere (chung2015learning; lecarpentier2017empirical), or that of vehicle routing in dynamic traffic congestion. Realistically, we consider that the expected reward and transition functions do not evolve arbitrarily fast over time. Conversely, if such an assumption was not made, a chaotic evolution of the NSMDP would be allowed which is both unrealistic and hard to solve. Hence, we assume that changes occur slowly over time. Mathematically, we formalize this hypothesis by bounding the evolution rate of the transition and expected reward functions, using the notion of Lipschitz Continuity (LC).
Definition 2.
Lipschitz Continuity. Let $\mathrm{(}\mathrm{X}\mathrm{,}{\mathrm{d}}_{\mathrm{X}}\mathrm{)}$ and $\mathrm{(}\mathrm{Y}\mathrm{,}{\mathrm{d}}_{\mathrm{Y}}\mathrm{)}$ be two metric spaces and $\mathrm{f}\mathrm{:}\mathrm{X}\mathrm{\to}\mathrm{Y}$, $\mathrm{f}$ is $\mathrm{L}$Lipschitz Continuous ($\mathrm{L}$LC) with $\mathrm{L}\mathrm{\in}{\mathrm{R}}^{\mathrm{+}}$ iff ${\mathrm{d}}_{\mathrm{Y}}\mathrm{}\mathrm{(}\mathrm{f}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{)}\mathrm{,}\mathrm{f}\mathrm{}\mathrm{(}\widehat{\mathrm{x}}\mathrm{)}\mathrm{)}\mathrm{\le}\mathrm{L}\mathrm{}{\mathrm{d}}_{\mathrm{X}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\widehat{\mathrm{x}}\mathrm{)}\mathrm{,}\mathrm{\forall}\mathrm{(}\mathrm{x}\mathrm{,}\widehat{\mathrm{x}}\mathrm{)}\mathrm{\in}{\mathrm{X}}^{\mathrm{2}}$. $\mathrm{L}$ is called a Lipschitz constant of the function $\mathrm{f}$.
We apply this hypothesis to the transition and reward functions of an NSMDP so that those functions are LC w.r.t. time. For the transition function, this leads to the consideration of a metric between probability density functions. For that purpose, we use the 1Wasserstein distance (villani2008optimal).
Definition 3.
1Wasserstein distance. Let $\mathrm{(}\mathrm{X}\mathrm{,}{\mathrm{d}}_{\mathrm{X}}\mathrm{)}$ be a Polish metric space, $\mathrm{\mu}\mathrm{,}\mathrm{\nu}$ any probability measures on $\mathrm{X}$, $\mathrm{\Pi}\mathrm{}\mathrm{(}\mathrm{\mu}\mathrm{,}\mathrm{\nu}\mathrm{)}$ the set of joint distributions on $\mathrm{X}\mathrm{\times}\mathrm{X}$ with marginals $\mathrm{\mu}$ and $\mathrm{\nu}$. The 1Wasserstein distance between $\mathrm{\mu}$ and $\mathrm{\nu}$ is ${\mathrm{W}}_{\mathrm{1}}\mathrm{}\mathrm{(}\mathrm{\mu}\mathrm{,}\mathrm{\nu}\mathrm{)}\mathrm{=}{\mathrm{inf}}_{\mathrm{\pi}\mathrm{\in}\mathrm{\Pi}\mathrm{}\mathrm{(}\mathrm{\mu}\mathrm{,}\mathrm{\nu}\mathrm{)}}\mathrm{}{\mathrm{\int}}_{\mathrm{X}\mathrm{\times}\mathrm{X}}{\mathrm{d}}_{\mathrm{X}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{y}\mathrm{)}\mathrm{}\mathrm{d}\mathrm{\pi}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{y}\mathrm{)}$.
The choice of the Wasserstein distance is motivated by the fact that it quantifies the distance between two distributions in a physical manner, respectful of the topology of the measured space (dabney2018distributional; asadi2018lipschitz). First, it is sensitive to the difference between the supports of the distributions. Comparatively, the KullbackLeibler divergence between distributions with disjoint supports is infinite. Secondly, if one consider two regions of the support where two distributions differ, the Wasserstein distance is sensitive to the distance between the elements of those regions. Comparatively, the totalvariation metric is the same regardless of this distance.
Definition 4.
$({L}_{p},{L}_{r})$LCNSMDP. An $\mathrm{(}{\mathrm{L}}_{\mathrm{p}}\mathrm{,}{\mathrm{L}}_{\mathrm{r}}\mathrm{)}$LCNSMDP is an NSMDP whose transition and reward functions are respectively ${\mathrm{L}}_{\mathrm{p}}$LC and ${\mathrm{L}}_{\mathrm{r}}$LC w.r.t. time, i.e., $\mathrm{\forall}\mathrm{(}\mathrm{t}\mathrm{,}\widehat{\mathrm{t}}\mathrm{,}\mathrm{s}\mathrm{,}{\mathrm{s}}^{\mathrm{\prime}}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{\in}{\mathrm{T}}^{\mathrm{2}}\mathrm{\times}{\mathrm{S}}^{\mathrm{2}}\mathrm{\times}\mathrm{A}$,
$${W}_{1}({p}_{t}(\cdot \mid s,a),{p}_{\widehat{t}}(\cdot \mid s,a))\le {L}_{p}t\widehat{t}\mathit{\hspace{1em}}\text{\mathit{a}\mathit{n}\mathit{d}}\mathit{\hspace{1em}}{r}_{t}(s,a,{s}^{\prime}){r}_{\widehat{t}}(s,a,{s}^{\prime})\le {L}_{r}t\widehat{t}.$$ 
One should remark that the LC property should be defined with respect to actual decision times and not decision epoch indexes for the sake of realism. In the present case, both have the same value, and we choose to keep this convention for clarity. Our results however extend easily to the case where indexes and times do not coincide. From now on, we consider $({L}_{p},{L}_{r})$LCNSMDPs, making Lipschitz Continuity our regularity property. Notice that $R$ is defined as a convex combination of $r$ by the probability measure $p$. As a result, the notion of Lipschitz Continuity of $R$ is strongly related to that of $r$ and $p$ as showed by Property 1. All the proofs of the paper can be found in the Appendix.
Property 1.
Given an $\mathrm{(}{L}_{p}\mathrm{,}{L}_{r}\mathrm{)}$LCNSMDP, the expected reward function ${R}_{t}\mathrm{:}s\mathrm{,}a\mathrm{\mapsto}{\mathrm{E}}_{{s}^{\mathrm{\prime}}\mathrm{\sim}{p}_{t}\mathrm{(}\mathrm{\cdot}\mathrm{\mid}s\mathrm{,}a\mathrm{)}}\mathit{}\mathrm{\left\{}{r}_{t}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{)}\mathrm{\right\}}$ is ${L}_{R}$LC with ${L}_{R}\mathrm{=}{L}_{r}\mathrm{+}{L}_{p}$.
This result shows $R$’s evolution rate is conditioned by the evolution rates of $r$ and $p$. It allows to work either with the reward function $r$ or its expectation $R$, benefiting from the same LC property.
3 Related work
iyengar2005robust introduced the framework of robust MDPs, where the transition function is allowed to evolve within a set of functions due to uncertainty. This differs from our work in two fundamental aspects: 1) we consider uncertainty in the reward model as well; 2) we use a stronger Lipschitz formulation on the set of possible transition and reward functions, this last point being motivated by its relevance to the nonstationary setting. szita2002varepsilon also consider the robust MDP setting and adopt a different constraint hypothesis on the set of possible functions than our LC assumption. They control the total variation distance of transition functions from subsequent decision epochs by a scalar value. Those slowly changing environments allow modelfree RL algorithms such as QLearning to find near optimal policies. lim2013reinforcement consider learning in robust MDPs where the model evolves in an adversarial manner for a subset of $\mathcal{S}\times \mathcal{A}$. In that setting, they propose to learn to what extent the adversary can modify the model and to deduce a behavior close to the minimax policy. even2009online studied the case of nonstationary reward functions with fixed transition models. No assumption is made on the set of possible functions and they propose an algorithm achieving sublinear regret w.r.t. the best stationary policy. dick2014online viewed a similar setting from the perspective of online linear optimization. csaji2008value studied the NSMDP setting with an assumption of reward and transition functions varying in a neighborhood of a reference rewardtransition function pair. Finally, abbasi2013online address the adversarial NSMDP setting with a mixing assumption constraint instead of the LC assumption we make.
Nonstationary environments also have been studied through the framework of Hidden Mode MDPs (HMMDP) introduced by choi1999hidden. This is a special class of Partially Observable MDPs (POMDPs) (kaelbling1998planning) where a hidden mode indexes a latent stationary MDP within which the agent evolves. Similarly to the context of LRL, the agent experiences a series of different MDPs over time. In this setting, choi1999hidden; choi2000hidden proposed methods to learn the different models of the latent stationary MDPs. doya2002multiple built a modular architecture switching between models and policies when a change is detected. Similarly, wiering2001reinforcement; da2006dealing; hadoux2014sequential proposed a method tracking the switching occurrence and replanning if needed. Overall, as in LRL, the HMMDP setting considers abrupt evolution of the transition and reward functions whereas we consider a continuous one. Other settings have been considered, as by jaulmes2005learning, who do not make particular hypothesis on the evolution of the NSMDP. They build a learning algorithm for POMDPs solving, weighting recently experienced transitions more than older ones to account for the time dependency.
To plan robustly within an NSMDP, our approach consists in exploiting the slow LC evolution of the environment. Utilizing Lipschitz continuity to infer bounds on a function is common in the RL, bandit and optimization communities (kleinberg2008multi; rachelson2010locality; pirotta2015policy; pazis2013pac; munos2014bandits). We implement this approach with a minimaxlike algorithm (fudenberg1991game), where the environment is seen as an adversarial agent.
4 Worstcase approach
We consider finding an optimal policy within an LCNSMDP under the nonepisodic task hypothesis. The latter prevents us from learning from previous experience data since they become outdated with time and no information samples have been collected yet for future time steps. An alternative is to use modelbased RL algorithms such as MCTS. For a current state ${s}_{0}$, such algorithms focus on finding the optimal action ${a}_{0}^{*}$ by using a generative model. This action is then undertaken and the operation repeated at the next state. However, using the true NSMDP model for this purpose is an unrealistic hypothesis, since this model is generally unknown. We assume the agent does not have access to the true NSMDP model; instead, we introduce the notion of snapshot model.Intuitively, the snapshot associated to time ${t}_{0}$ is a temporal slice of the NSMDP at ${t}_{0}$.
Definition 5.
Snapshot of an NSMDP. The snapshot of an NSMDP $\mathrm{\{}\mathrm{S}\mathrm{,}\mathrm{T}\mathrm{,}\mathrm{A}\mathrm{,}{\mathrm{\left(}{\mathrm{p}}_{\mathrm{t}}\mathrm{\right)}}_{\mathrm{t}\mathrm{\in}\mathrm{T}}\mathrm{,}{\mathrm{\left(}{\mathrm{r}}_{\mathrm{t}}\mathrm{\right)}}_{\mathrm{t}\mathrm{\in}\mathrm{T}}\mathrm{\}}$ at decision epoch ${\mathrm{t}}_{\mathrm{0}}$, denoted by MDP${}_{{\mathrm{t}}_{\mathrm{0}}}$, is the stationary MDP defined by the 4tuple $\mathrm{\{}\mathrm{S}\mathrm{,}\mathrm{A}\mathrm{,}{\mathrm{p}}_{{\mathrm{t}}_{\mathrm{0}}}\mathrm{,}{\mathrm{r}}_{{\mathrm{t}}_{\mathrm{0}}}\mathrm{\}}$ where ${\mathrm{p}}_{{\mathrm{t}}_{\mathrm{0}}}\mathrm{}\mathrm{(}{\mathrm{s}}^{\mathrm{\prime}}\mathrm{\mid}\mathrm{s}\mathrm{,}\mathrm{a}\mathrm{)}$ and ${\mathrm{r}}_{{\mathrm{t}}_{\mathrm{0}}}\mathrm{}\mathrm{(}\mathrm{s}\mathrm{,}\mathrm{a}\mathrm{,}{\mathrm{s}}^{\mathrm{\prime}}\mathrm{)}$ are the transition and reward functions of the NSMDP at ${\mathrm{t}}_{\mathrm{0}}$.
Similarly to the NSMDP, this definition induces the existence of the snapshot expected reward ${R}_{{t}_{0}}$ defined by ${R}_{{t}_{0}}:s,a\mapsto {\mathbb{E}}_{{s}^{\prime}\sim {p}_{{t}_{0}}(\cdot \mid s,a)}\left\{{r}_{{t}_{0}}(s,a,{s}^{\prime})\right\}$. Notice that the snapshot MDP${}_{{t}_{0}}$ is stationary and coincides with the NSMDP only at ${t}_{0}$. Particularly, one can generate a trajectory $\{{s}_{0},{r}_{0},\mathrm{\cdots},{s}_{k}\}$ within an NSMDP using the sequence of snapshots $\{{\text{MDP}}_{{t}_{0}},\mathrm{\cdots},{\text{MDP}}_{{t}_{0}+k1}\}$ as a model. Overall, the hypothesis of using snapshot models amounts to considering a planning agent only able to get the current stationary model of the environment. In realworld problems, predictions often are uncertain or hard to perform e.g. in the thermal soaring problem of a glider.
We consider a generic planning agent at ${s}_{0},{t}_{0}$, using MDP${}_{{t}_{0}}$ as a model of the NSMDP. By planning, we mean conducting a lookahead search within the possible trajectories starting from ${s}_{0},{t}_{0}$ given a model of the environment. The search allows in turn to identify an optimal action w.r.t. the model. This action is then undertaken and the agent jumps to the next state where the operation is repeated. The consequence of planning with MDP${}_{{t}_{0}}$ is that the estimated value of an $s,t$ pair is the value of the optimal policy of ${\text{MDP}}_{{t}_{0}}$, written ${V}_{{\text{MDP}}_{{t}_{0}}}^{*}(s)$. The true optimal value of $s$ at $t$ within the NSMDP does not match this estimate because of the nonstationarity. The intuition we develop is that, given the slow evolution rate of the environment, for a state $s$ seen at a future decision epoch during the search, we can predict a scope into which the transition and reward functions at $s$ lie.
Property 2.
Set of admissible snapshot models. Consider an $\mathrm{(}{\mathrm{L}}_{\mathrm{p}}\mathrm{,}{\mathrm{L}}_{\mathrm{r}}\mathrm{)}$LCNSMDP, $\mathrm{s}\mathrm{,}\mathrm{t}\mathrm{,}\mathrm{a}\mathrm{\in}\mathrm{S}\mathrm{\times}\mathrm{T}\mathrm{\times}\mathrm{A}$. The transition and expected reward functions $\mathrm{(}{\mathrm{p}}_{\mathrm{t}}\mathrm{,}{\mathrm{R}}_{\mathrm{t}}\mathrm{)}$ of the snapshot ${\mathrm{\text{MDP}}}_{\mathrm{t}}$ respect
$$({p}_{t},{R}_{t})\in {\mathrm{\Delta}}_{t}\u2254{\mathcal{B}}_{{W}_{1}}({p}_{t1}(\cdot \mid s,a),{L}_{p})\times {\mathcal{B}}_{\cdot }({R}_{t1}(s,a),{L}_{R})$$ 
where ${L}_{R}\mathrm{=}{L}_{p}\mathrm{+}{L}_{r}$ and ${\mathrm{B}}_{d}\mathit{}\mathrm{(}c\mathrm{,}r\mathrm{)}$ denotes the ball of centre $c$, defined with metric $d$ and radius $r$.
For a future prediction at $s,t$, we consider the question of using a better model than ${p}_{{t}_{0}},{R}_{{t}_{0}}$. The underlying evolution of the NSMDP being unknown, a desirable feature would be to use a model leading to a policy that is robust to every possible evolution. To that end, we propose to use the snapshots corresponding to the worst possible evolution scenario under the constraints of Property 2. We claim that such a practice is an efficient way to 1) ensure robust performance to all possible evolutions of the NSMDP and 2) avoid catastrophic terminal states. Practically, this boils down to using a different value estimate for $s$ at $t$ than ${V}_{{\text{MDP}}_{{t}_{0}}}^{*}(s)$ which provided no robustness guarantees.
Given a policy $\pi ={\left({\pi}_{t}\right)}_{t\in \mathcal{T}}$ and a decision epoch $t$, a worstcase NSMDP corresponds to a sequence of transition and reward models minimizing the expected value of applying $\pi $ in any pair $(s,t)$, while remaining within the bounds of Property 2. We write ${\overline{V}}_{t}^{\pi}(s)$ this value for $s$ at decision epoch $t$.
$${\overline{V}}_{t}^{\pi}(s)\u2254\underset{\begin{array}{c}({p}_{i},{R}_{i})\in {\mathrm{\Delta}}_{i},\forall i\in \mathcal{T}\end{array}}{\mathrm{min}}\mathbb{E}\left[\sum _{i=t}^{\mathrm{\infty}}{\gamma}^{it}{R}_{i}({s}_{i},{a}_{i})\begin{array}{cc}& {s}_{t}=s\hfill \\ & {a}_{i}\sim {\pi}_{i}(\cdot \mid {s}_{i}),{s}_{i+1}\sim {p}_{i}(\cdot \mid {s}_{i},{a}_{i})\hfill \end{array}\right]$$  (1) 
Intuitively, the worstcase NSMDP is a model of a nonstationary environment leading to the poorest possible performance for $\pi $, while being an admissible evolution of MDP${}_{t}$. Let us define ${\overline{Q}}_{t}^{\pi}(s,a)$ as the worstcase $Q$value for the pair $(s,a)$ at decision epoch $t$:
$${\overline{Q}}_{t}^{\pi}(s,a)\u2254\underset{\begin{array}{c}(p,R)\in {\mathrm{\Delta}}_{t}\end{array}}{\mathrm{min}}\underset{{s}^{\prime}\sim p}{\mathbb{E}}\left[R(s,a)+\gamma {\overline{V}}_{t+1}^{\pi}({s}^{\prime})\right].$$  (2) 
5 RiskAverse TreeSearch algorithm
The algorithm. Tree search algorithms within MDPs have been well studied and cover two classes of search trees, namely closed loop (keller2013trial; kocsis2006bandit; browne2012survey) and open loop (bubeck2010open; lecarpentier2018open). Following (keller2013trial), we consider closed loop search trees, composed of decision nodes alternating with chance nodes. We adapt their formulation to take time into account, resulting in the following definitions. A decision node at depth $t$, denoted by ${\nu}^{s,t}$, is labeled by a unique state / decision epoch pair $(s,t)$. The edges leading to its children chance nodes correspond to the available actions at $(s,t)$. A chance node, denoted by ${\nu}^{s,t,a}$, is labeled by a state / decision epoch / action triplet $(s,t,a)$. The edges leading to its children decision nodes correspond to the reachable state / decision epoch pairs $({s}^{\prime},{t}^{\prime})$ after performing $a$ in $(s,t)$ as illustrated by Figure 1.

We consider the problem of estimating the optimal action ${a}_{0}^{*}$ at ${s}_{0},{t}_{0}$ within a worstcase NSMDP, knowing MDP${}_{{t}_{0}}$. This problem is twofold. It requires 1) to estimate the worstcase NSMDP given MDP${}_{{t}_{0}}$ and 2) to explore the latter in order to identify ${a}_{0}^{*}$. We propose to tackle both problems with an algorithm inspired by the minimax algorithm (fudenberg1991game) where the max operator corresponds to the agent’s policy, seeking to maximize the return; and the min operator corresponds to the worstcase model, seeking to minimize the return. Estimating the worstcase NSMDP requires to estimate the sequence of subsequent snapshots minimizing Equation 2. The interdependence of those snapshots (Equation 1) makes the problem hard to solve (iyengar2005robust), particularly because of the combinatorial nature of the opponent’s action space. Instead, we propose to solve a relaxation of this problem, by considering snapshots only constrained by MDP${}_{{t}_{0}}$. Making this approximation leaves a possibility to violate property 2 but allows for an efficient search within the developed tree and (as will be shown experimentally) leads to robust policies. For that purpose, we define the set of admissible snapshot models w.r.t. MDP${}_{{t}_{0}}$ by ${\mathrm{\Delta}}_{{t}_{0}}^{t}\u2254{\mathcal{B}}_{{W}_{1}}({p}_{{t}_{0}}(\cdot \mid s,a),{L}_{p}t{t}_{0})\times {\mathcal{B}}_{\cdot }({R}_{{t}_{0}}(s,a),{L}_{R}t{t}_{0})$. The relaxed analogues of Equations 1 and 2 for $s,t,a\in \mathcal{S}\times \mathcal{T}\times \mathcal{A}$ are defined as follows:
$${\widehat{V}}_{{t}_{0},t}^{\pi}(s)\u2254\underset{({p}_{i},{R}_{i})\in {\mathrm{\Delta}}_{{t}_{0}}^{i},\forall i\in \mathcal{T}}{\mathrm{min}}\mathbb{E}\left[\sum _{i=t}^{\mathrm{\infty}}{\gamma}^{it}{R}_{i}({s}_{i},{a}_{i})\begin{array}{cc}& {s}_{t}=s\hfill \\ & {a}_{i}\sim {\pi}_{i}(\cdot \mid {s}_{i}),{s}_{i+1}\sim {p}_{i}(\cdot \mid {s}_{i},{a}_{i})\hfill \end{array}\right],$$  
$${\widehat{Q}}_{{t}_{0},t}^{\pi}(s,a)\u2254\underset{(p,R)\in {\mathrm{\Delta}}_{{t}_{0}}^{t}}{\mathrm{min}}\underset{{s}^{\prime}\sim p}{\mathbb{E}}\left[R(s,a)+\gamma {\widehat{V}}_{{t}_{0},t+1}^{\pi}({s}^{\prime})\right].$$ 
Their optimal counterparts, while seeking to find the optimal policy, verify the following equations:
${\widehat{V}}_{{t}_{0},t}^{*}(s)$  $=\underset{a\in \mathcal{A}}{\mathrm{max}}{\widehat{Q}}_{{t}_{0},t}^{*}(s,a),$  (3)  
${\widehat{Q}}_{{t}_{0},t}^{*}(s,a)$  $=\underset{\begin{array}{c}(p,R)\in {\mathrm{\Delta}}_{{t}_{0}}^{t}\end{array}}{\mathrm{min}}\underset{\begin{array}{c}{s}^{\prime}\sim p\end{array}}{\mathbb{E}}\left[R(s,a)+\gamma {\widehat{V}}_{{t}_{0},t+1}^{*}({s}^{\prime})\right].$  (4) 
We now provide a method to calculate those quantities within the nodes of the tree search algorithm.
Max nodes.
A decision node ${\nu}^{s,t}$ corresponds to a max node due to the greediness of the agent w.r.t. the subsequent values of the children.
We aim at maximizing the return while retaining a riskaverse behavior.
As a result, the value of ${\nu}^{s,t}$ follows Equation 3 and is defined as:
$$V({\nu}^{s,t})=\underset{a\in \mathcal{A}}{\mathrm{max}}V({\nu}^{s,t,a}).$$  (5) 
Min nodes. A chance node ${\nu}^{s,t,a}$ corresponds to a min node due to the use of a worstcase NSMDP as a model which minimizes the value of ${\nu}^{s,t,a}$ w.r.t. the reward and the subsequent values of its children. Writing the value of ${\nu}^{s,t,a}$ as the value of $s,t,a$, within the worstcase snapshot minimizing Equation 4, and using the children’s values as values for the next reachable states, leads to Equation 6.
$$V({\nu}^{s,t,a})=\underset{(p,R)\in {\mathrm{\Delta}}_{{t}_{0}}^{t}}{\mathrm{min}}R(s,a)+\gamma \underset{{s}^{\prime}\sim p}{\mathbb{E}}V({\nu}^{{s}^{\prime},t+1})$$  (6) 
Our approach considers the environment as an adversarial agent, as in an asymmetric twoplayer game, in order to search for a robust plan. The resulting algorithm, RATS for RiskAverse TreeSearch, is described in Algorithm 6. Given an initial state / decision epoch pair, a minimax tree is built using the snapshot MDP${}_{{t}_{0}}$ and the operators corresponding to Equations 5 and 6 in order to estimate the worstcase snapshots at each depth. The tree is built, the action leading to the best possible value from the root node is selected and a real transition is performed. The next state is then reached, the new snapshot model MDP${}_{{t}_{0}+1}$ is acquired and the process restarts. Notice the use of $R(\nu )$ and $p({\nu}^{\prime}\mid \nu )$ in the pseudocode: they are light notations respectively standing for ${R}_{t}(s,a)$ corresponding to a chance node $\nu \equiv {\nu}^{s,t,a}$ and the probability ${p}_{t}({s}^{\prime}s,a)$ to jump to a decision node ${\nu}^{\prime}\equiv {\nu}^{{s}^{\prime},t+1}$ given a chance node $\nu \equiv {\nu}^{s,t,a}$. {algorithm2e}[t] \DontPrintSemicolonRATS (${s}_{0}$, ${t}_{0}$, maxDepth) ${\nu}_{0}$ = rootNode(${s}_{0},{t}_{0}$) Minimax(${\nu}_{0}$) ${\nu}^{*}={\mathrm{arg}\mathrm{max}}_{{\nu}^{\prime}\mathbf{\text{in}}{\nu}_{0}\text{.children}}{\nu}^{\prime}\text{.value}$ return ${\nu}^{\mathrm{*}}\mathrm{.}\mathrm{\text{action}}$ Minimax ($\nu $, maxDepth) \If$\nu $ is DecisionNode \If$\nu $.state is terminal or $\nu $.depth = maxDepth return $\nu $.value = heuristicValue($\nu $.state) \Else return $\nu $.value = ${\mathrm{max}}_{{\nu}^{\prime}\in \nu \text{.children}}$Minimax(${\nu}^{\prime}$, maxDepth) \Else return $\nu \mathbf{}\mathrm{\text{.value}}\mathrm{=}{\mathrm{min}}_{\mathrm{(}p\mathrm{,}R\mathrm{)}\mathrm{\in}{\mathrm{\Delta}}_{{t}_{\mathrm{0}}}^{t}}\mathbf{}R\mathbf{}\mathrm{(}\nu \mathrm{)}\mathrm{+}\gamma \mathbf{}{\mathrm{\sum}}_{{\nu}^{\mathrm{\prime}}\mathrm{\in}\nu \mathbf{}\mathrm{\text{.children}}}p\mathbf{}\mathrm{(}{\nu}^{\mathrm{\prime}}\mathrm{\mid}\nu \mathrm{)}\mathbf{}\mathrm{\text{Minimax}}\mathbf{}\mathrm{(}{\nu}^{\mathrm{\prime}}\mathrm{,}\mathrm{\text{maxDepth}}\mathrm{)}$ The tree built by RATS is entirely developed until the maximum depth ${d}_{\mathrm{max}}$. A heuristic function is used to evaluate the leaf nodes of the tree.
Analysis of RATS. We are interested in characterizing Algorithm 6 without function approximation and therefore will consider finite, countable, $\mathcal{S}\times \mathcal{A}$ sets. We now detail the computation of the min operator (Property 3), the computational complexity of RATS (Property 4) and the heuristic function.
Property 3.
Closedform expression of the worst case snapshot of a chance node. Following Algorithm 6, a solution to Equation 6 is given by:
$$\widehat{R}(s,a)={R}_{{t}_{0}}(s,a){L}_{R}t{t}_{0}\mathit{\hspace{1em}}\text{\mathit{a}\mathit{n}\mathit{d}}\mathit{\hspace{1em}}\widehat{p}(\cdot \mid s,a)=(1\lambda ){p}_{{t}_{0}}(\cdot \mid s,a)+\lambda {p}_{\text{\mathit{s}\mathit{a}\mathit{t}}}(\cdot \mid s,a)$$ 
with ${p}_{\text{\mathit{s}\mathit{a}\mathit{t}}}\mathrm{(}\mathrm{\cdot}\mathrm{\mid}s\mathrm{,}a\mathrm{)}\mathrm{=}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{\cdots}\mathrm{,}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{,}\mathrm{0}\mathrm{,}\mathrm{\cdots}\mathrm{,}\mathrm{0}\mathrm{)}$ with $\mathrm{1}$ at position ${\mathrm{arg}\mathit{}\mathrm{min}}_{{s}^{\mathrm{\prime}}}\mathit{}V\mathit{}\mathrm{(}{\nu}^{{s}^{\mathrm{\prime}}\mathrm{,}t\mathrm{+}\mathrm{1}}\mathrm{)}$, $\lambda \mathrm{=}\mathrm{1}\mathit{}\mathit{\text{if}}\mathit{}{W}_{\mathrm{1}}\mathit{}\mathrm{(}{p}_{\text{\mathit{s}\mathit{a}\mathit{t}}}\mathrm{,}{p}_{\mathrm{0}}\mathrm{)}\mathrm{\le}{L}_{p}\mathit{}\mathrm{}t\mathrm{}{t}_{\mathrm{0}}\mathrm{}$ and $\lambda \mathrm{=}{L}_{p}\mathit{}\mathrm{}t\mathrm{}{t}_{\mathrm{0}}\mathrm{}\mathrm{/}{W}_{\mathrm{1}}\mathit{}\mathrm{(}{p}_{\text{\mathit{s}\mathit{a}\mathit{t}}}\mathrm{,}{p}_{\mathrm{0}}\mathrm{)}$ otherwise.
Property 4.
Computational complexity. The total computation complexity of Algorithm 6 is $\mathrm{O}\mathrm{}\mathrm{(}\mathrm{B}\mathrm{}{\mathrm{}\mathrm{S}\mathrm{}}^{\mathrm{1.5}}\mathrm{}\mathrm{}\mathrm{A}\mathrm{}\mathrm{}{\mathrm{\left(}\mathrm{}\mathrm{S}\mathrm{}\mathrm{}\mathrm{}\mathrm{A}\mathrm{}\mathrm{\right)}}^{{\mathrm{d}}_{\mathrm{max}}}\mathrm{)}$ with $\mathrm{B}$ the number of time steps and ${\mathrm{d}}_{\mathrm{max}}$ the maximum depth.
Heuristic function. As in vanilla minimax algorithms, Algorithm 6 bootstraps the values of the leaf nodes with a heuristic function if these leaves do not correspond to terminal states. Given such a leaf node ${\nu}^{s,t}$, a heuristic aims at estimating the value of the optimal policy at $(s,t)$ within the worstcase NSMDP, i.e. ${\widehat{V}}_{{t}_{0},t}^{*}(s)$. Let $H(s,t)$ be such a heuristic function, we call heuristic error in $(s,t)$ the difference between $H(s,t)$ and ${\widehat{V}}_{{t}_{0},t}^{*}(s)$. Assuming that the heuristic error is uniformly bounded, the following property provides an upper bound on the propagated error due to the choice of $H$.
Property 5.
Upper bound on the propagated heuristic error within RATS. Consider an agent executing Algorithm 6 at ${\mathrm{s}}_{\mathrm{0}}\mathrm{,}{\mathrm{t}}_{\mathrm{0}}$ with a heuristic function $\mathrm{H}$. We note $\mathrm{L}$ the set of all leaf nodes. Suppose that the heuristic error is uniformly bounded, i.e. $\mathrm{\exists}\mathrm{\delta}\mathrm{>}\mathrm{0}\mathrm{,}\mathrm{\forall}{\mathrm{\nu}}^{\mathrm{s}\mathrm{,}\mathrm{t}}\mathrm{\in}\mathrm{L}\mathrm{,}\mathrm{}\mathrm{H}\mathrm{}\mathrm{(}\mathrm{s}\mathrm{)}\mathrm{}{\widehat{\mathrm{V}}}_{{\mathrm{t}}_{\mathrm{0}}\mathrm{,}\mathrm{t}}^{\mathrm{*}}\mathrm{}\mathrm{(}\mathrm{s}\mathrm{)}\mathrm{}\mathrm{\le}\mathrm{\delta}$. Then we have for every decision and chance nodes ${\mathrm{\nu}}^{\mathrm{s}\mathrm{,}\mathrm{t}}$ and ${\mathrm{\nu}}^{\mathrm{s}\mathrm{,}\mathrm{t}\mathrm{,}\mathrm{a}}$, at any depth $\mathrm{d}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{\mathrm{d}}_{\mathrm{max}}\mathrm{]}$:
$V({\nu}^{s,t}){\widehat{V}}_{{t}_{0},t}^{*}(s)\le {\gamma}^{({d}_{\mathrm{max}}d)}\delta \mathit{\hspace{1em}}\mathit{\text{and}}\mathit{\hspace{1em}}V({\nu}^{s,t,a}){\widehat{Q}}_{{t}_{0},t}^{*}(s,a)\le {\gamma}^{({d}_{\mathrm{max}}d)}\delta .$ 
This last result implies that with any heuristic function $H$ inducing a uniform heuristic error, the propagated error at the root of the tree is guaranteed to be upper bounded by ${\gamma}^{{d}_{\mathrm{max}}}\delta $. In particular, since the reward function is bounded by hypothesis, we have ${\widehat{V}}_{{t}_{0},t}^{*}(s)\le 1/(1\gamma )$. Thus, selecting for instance the zero function ensures a root node heuristic error of at most ${\gamma}^{{d}_{\mathrm{max}}}/(1\gamma )$. In order to improve the precision of the algorithm, we propose to guide the heuristic by using a function reflecting better the value of state $s$ at leaf node ${\nu}^{s,t}$. The ideal function would of course be $H(s)={\widehat{V}}_{{t}_{0},t}^{*}(s)$, reducing the heuristic error to zero, but this is intractable. Instead, we suggest to use the value of $s$ within the snapshot MDP${}_{t}$ using an evaluation policy $\pi $, i.e. $H(s)={V}_{{\text{MDP}}_{t}}^{\pi}(s)$. This snapshot is also not available, but Property 6 provides a range wherein this value lies.
Property 6.
Bounds on the snapshots values. Let $\mathrm{s}\mathrm{\in}\mathrm{S}$, $\mathrm{\pi}$ a stationary policy, MDP${}_{{\mathrm{t}}_{\mathrm{0}}}$ and MDP${}_{\mathrm{t}}$ two snapshot MDPs, $\mathrm{t}\mathrm{,}{\mathrm{t}}_{\mathrm{0}}\mathrm{\in}{\mathrm{T}}^{\mathrm{2}}$ be. We note ${\mathrm{V}}_{{\mathrm{\text{MDP}}}_{\mathrm{i}}}^{\mathrm{\pi}}\mathrm{}\mathrm{(}\mathrm{s}\mathrm{)}$ the value of $\mathrm{s}$ within MDP${}_{\mathrm{i}}$ following $\mathrm{\pi}$. Then,
$${V}_{{\text{\mathit{M}\mathit{D}\mathit{P}}}_{{t}_{0}}}^{\pi}(s){V}_{{\text{\mathit{M}\mathit{D}\mathit{P}}}_{t}}^{\pi}(s)\le t{t}_{0}{L}_{R}/(1\gamma ).$$ 
Since ${\text{MDP}}_{{t}_{0}}$ is available, ${V}_{{\text{MDP}}_{{t}_{0}}}^{\pi}(s)$ can be estimated, e.g. via MonteCarlo rollouts. Let ${\widehat{V}}_{{\text{MDP}}_{{t}_{0}}}^{\pi}(s)$ denote such an estimate. Following Property 6, ${V}_{{\text{MDP}}_{{t}_{0}}}^{\pi}(s)t{t}_{0}{L}_{R}/(1\gamma )\le {V}_{{\text{MDP}}_{t}}^{\pi}(s)$. Hence, a worstcase heuristic on ${V}_{{\text{MDP}}_{t}}^{\pi}(s)$ is $H(s)={\widehat{V}}_{{\text{MDP}}_{{t}_{0}}}^{\pi}(s)t{t}_{0}{L}_{R}/(1\gamma )$. The bounds provided by Property 5 decrease quickly with ${d}_{\mathrm{max}}$, and given that ${d}_{\mathrm{max}}$ is large enough, RATS provides the optimal riskaverse maximizing the worstcase value for any evolution of the NSMDP.
6 Experiments
We compare the RATS algorithm with two policies ^{1}^{1} 1 Code: https://github.com/SuReLI/ratsexperiments – ML reproducibility checklist: Appendix Section LABEL:sec:reproducibilitychecklist. . The first one, named DPsnapshot, uses Dynamic Programming to compute the optimal actions w.r.t. the snapshot models at each decision epoch. The second one, named DPNSMDP, uses the real NSMDP as a model to provide its optimal action. The latter behaves as an omniscient agent and should be seen as an upper bound on the performance. We choose a particular gridworld domain coined “NonStationary bridge” illustrated in Appendix, Section LABEL:sec:nsbridgefigure. An agent starts at the state labeled S in the center and the goal is to reach one of the two terminal states labeled G where a reward of +1 is received. The gray cells represent holes that are terminal states where a reward of 1 is received. Reaching the goal on the right leads to the highest payoff since it is closest to the initial state and a discount factor $\gamma =0.9$ is applied. The actions are $\mathcal{A}=\{\text{Up, Right, Down, Left}\}$. The transition function is stochastic and nonstationary. At decision epoch $t=0$, any action deterministically yields the intuitive outcome. With time, when applying Left or Right, the probability to reach the positions usually stemming from Up and Down increases symmetrically until reaching 0.45. We set the Lipschitz constant ${L}_{p}=1$. Aside, we introduce a parameter $\u03f5\in [0,1]$ controlling the behavior of the environment. If $\u03f5=0$, only the lefthand side bridge becomes slippery with time. It reflects a close to worstcase evolution for a policy aiming to the lefthand side goal. If $\u03f5=1$, only the righthand side bridge becomes slippery with time. It reflects a close to worstcase evolution for a policy aiming to the righthand side goal. In between, the misstep probability is proportionally balanced between left and right. One should note that changing $\u03f5$ from 0 to 1 does not cover all the possible evolutions from MDP${}_{{t}_{0}}$ but provides a concrete, graphical illustration of RATS’s behavior for various possible evolutions of the NSMDP.
We tested RATS with ${d}_{\mathrm{max}}=6$ so that leaf nodes in the search tree are terminal states. Hence, the optimal riskaverse policy is applied and no heuristic approximation is made. Our goal is to demonstrate that planning in this worstcase NSMDP allows to minimize the loss given any possible evolution of the environment. To illustrate this, we report results reflecting different evolutions of the same NSMDP using the $\u03f5$ factor. It should be noted that, at $t=0$, RATS always moves to the left, even if the goal is further, since going to the right may be risky if the probabilities to go Up and Down increase. This corresponds to the careful, riskaverse, behavior. Conversely, DPsnapshot always moves to the right since ${\text{MDP}}_{0}$ does not capture this risk. As a result, the $\u03f5=0$ case reflects a favorable evolution for DPsnapshot and a bad one for RATS. The opposite occurs with $\u03f5=1$ where the cautious behavior dominates over the risky one, and the inbetween cases mitigate this effect.
In Figure 3(a), we display the achieved expected return for each algorithm as a function of $\u03f5$, i.e. as a function of the possible evolutions of the NSMDP. As expected, the performance of DPsnapshot strongly depends on this evolution. It achieves high return for $\u03f5=0$ and low return for $\u03f5=1$. Conversely, the performance of RATS varies less across the different values of $\u03f5$. The effect illustrated here is that RATS maximizes the minimal possible return given any evolution of the NSMDP. It provides the guarantee to achieve the best return in the worstcase. This behavior is highly desirable when one requires robust performance guarantees as, for instance, in critical certification processes.
Figure 3(b) displays the return distributions of the three algorithms for $\u03f5\in \{0,0.5,1\}$. The effect seen here is the tendency for RATS to diminish the left tail of the distribution corresponding to low returns for each evolution. It corresponds to the optimized criteria, i.e. robustly maximizing the worstcase value. A common risk measure is the Conditional Value at Risk (CVaR) defined as the expected return in the worst $q$% cases. We illustrate the CVaR at 5% achieved by each algorithm in Table 2. Notice that RATS always maximizes the CVaR compared to both DPsnapshot and DPNSMDP. Indeed, even if the latter uses the true model, the optimized criteria in DP is the expected return.
7 Conclusion
We proposed an approach for robust planning in nonstationary stochastic environments. We introduced the framework of Lipchitz Continuous NonStationary MDPs (NSMDPs) and derived the RiskAverse TreeSearch (RATS) algorithm, to predict the worstcase evolution and to plan optimally w.r.t. this worstcase NSMDP. We analyzed RATS theoretically and showed that it approximates a worstcase NSMDP with a control parameter that is the depth of the search tree. We showed empirically the benefit of the approach that searches for the highest lower bound on the worst achievable score. RATS is robust to every possible evolution of the environment, i.e. maximizing the expected worstcase outcome on the whole set of possible NSMDPs. Our method was applied to the uncertainty on the evolution of a model. Generally, it could be extended to any uncertainty on the model used for planning, given bounds on the set of the feasible models. The purpose of this contribution is to lay a basis of worstcase analysis for robust solutions to NSMDPs. As is, RATS is computationally intensive and scaling the algorithm to larger problems is an exciting future challenge.
Acknowledgments
This research was supported by the Occitanie region, France.
References
See pages  of ratsappendix/ratsappendix