Abstract
Reinforcement learning (RL) methods learn optimal decisions in the presenceof a stationary environment. However, the stationary assumption on theenvironment is very restrictive. In many real world problems like trafficsignal control, robotic applications, one often encounters situations withnonstationary environments and in these scenarios, RL methods yieldsuboptimal decisions. In this paper, we thus consider the problem ofdeveloping RL methods that obtain optimal decisions in a nonstationaryenvironment. The goal of this problem is to maximize the longterm discountedreward achieved when the underlying model of the environment changes over time.To achieve this, we first adapt a change point algorithm to detect change inthe statistics of the environment and then develop an RL algorithm thatmaximizes the longrun reward accrued. We illustrate that our change pointmethod detects change in the model of the environment effectively and thusfacilitates the RL algorithm in maximizing the longrun reward. We furthervalidate the effectiveness of the proposed solution on nonstationary randomMarkov decision processes, a sensor energy management problem and a trafficsignal control problem.
Quick Read (beta)
Reinforcement Learning in NonStationary Environments
Abstract
Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, etc., one often encounters situations with nonstationary environments and in these scenarios, RL methods yield suboptimal decisions. In this paper, we thus consider the problem of developing RL methods that obtain optimal decisions in a nonstationary environment. The goal of this problem is to maximize the longterm discounted reward accrued when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the longrun reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the longrun reward. We further validate the effectiveness of the proposed solution on nonstationary random Markov decision processes, a sensor energy management problem and a traffic signal control problem.
Keywords:
Markov decision processes, Reinforcement Learning, NonStationary Environments, Change Detection.compat=newest \usetikzlibraryintersections \usetikzlibrarypositioning,angles,quotes,patterns \tikzstyleblock = [rectangle, draw,text width=3em, text centered, rounded corners] \tikzstyleblock2 = [rectangle, draw, text centered, rounded corners] \tikzstyleline = [draw, latex]
1 Introduction
Autonomous agents are increasingly being designed for sequential decisionmaking tasks under uncertainty in various domains. For e.g., in traffic signal control Prashanth and Bhatnagar (2011), an autonomous agent decides on the green signal duration for all lanes at a traffic junction, while in robotic applications, humanlike robotic agents are built to dexterously manipulate physical objects Andrychowicz et al. (2019); Niroui et al. (2019). The common aspect in these applications is the evolution of the state of the system based on decisions by the agent. In traffic signal control for instance, the state is the vector of current congestion levels at the various lanes of a junction and the agent decides on the green signal duration for all lanes at the junction, while in a robotic application, the state can be motor angles of the joints etc., and the robot decides on the torque for all motors. The key aspect is that the decision by the agent affects the immediate next state of the system, the reward (or cost) obtained as well as the future states. Further, the sequence of decisions by the agent is ranked based on a fixed performance criterion, which is a function of the rewards obtained for all decisions made. The central problem in sequential decisionmaking is that the agent must find a sequence of decisions for every state such that this performance criterion is optimized. Markov decision processes, dynamic programming (DP) and reinforcement learning (RL) Bertsekas (2013); Puterman (2005); Sutton and Barto (2018) provide a rich mathematical framework and algorithms which aid an agent in sequential decision making under uncertainty.
In this paper, we consider an important facet of reallife applications where the agent has to deal with nonstationary rewards and nonstationary transition probabilities between system states. For example, in vehicular traffic signal control, the traffic inflow rate in some (or all) lanes is quite different during peak and offpeak hours. The varying traffic inflow rates makes some lane queue length configurations more probable compared to other configurations, depending on the peak and offpeak traffic patterns. It is paramount that under such conditions, the agent selects appropriate green signal duration taking into account the different traffic patterns. Also, in robotic navigation Niroui et al. (2019), the controller might have to vary robotic arm/limb joint angles depending on the terrain or weather conditions to ensure proper locomotion, because the same joint angles may give rise to different movement trajectories in varying terrains and weather conditions. When environment dynamics or rewards change with time, the agent must quickly adapt its policy to maximize the longterm cumulative rewards collected and ensure proper and efficient system operation as well. We view this scenario as illustrated in Fig. 1, where the environment changes between models $1,2,\mathrm{\dots},n$ dynamically. The epochs at which these changes take place are unknown to (or hidden from) the agent controlling the system. The implication of the nonstationary environment is this: when the agent exercises a control ${a}_{t}$ at time $t$, the next state ${s}_{t+1}$ as well as the reward ${r}_{t}$ are functions of the active environment model dynamics.
Motivated by the realworld applications where changing environment dynamics (and/or rewards, costs) is frequently observed, we focus on developing a modelfree RL method that learns optimal policies for nonstationary environments.
1.1 Our Contributions

•
The primary contribution of this paper is to propose a modelfree RL algorithm for handling nonstationary environments. In this work, we adapt Qlearning (QL) Watkins and Dayan (1992) to learn optimal policies for different environment models.
 •

•
Context Qlearning utilizes data samples collected during learning to detect changes in the model. To achieve this, it leverages a novel change detection algorithm Prabuchandran et al. (2019) for these samples. Using results of change detection, the method estimates policy for the new model or improves the policy learnt, if the model had been previously experienced. In this manner, our method avoids catastrophic forgetting Kemker et al. (2018).

•
Context Qlearning is an online method which can learn and store the policies for the different environment contexts. We assume that modelchange patterns are known. Our method is novel in the sense that we utilize state and reward samples to detect these changes, unlike previous works which need model information. Thus, our method is modelfree and can be used in realworld applications of RL.

•
The experimental results (see Section 6) show the performance of Context QL on standard problems as well as on interesting problems in traffic signal control and sensor networks. The performance is evaluated on metrics like mean detection delay as well as precision and recall. This is in addition to the main performance metric, which is the reward collected by the algorithm in presence of dynamic environments. The results are compared with recent relevant works and the advantages of Context Qlearning are highlighted.
1.2 Organization of the Paper
The rest of the paper is organised as follows. In the following Section, we discuss prior approaches to this problem. In Section 3, we give a brief background of Markov decision process (MDP) framework. This section describes the basic definitions and assumptions made by the DP algorithms for solving MDPs. Section 4 describes the problem in mathematical terms. Additionally, it also introduces the notation that will be used in the rest of the paper. We propose an RL method for nonstationary environments in Section 5. Section 6 shows numerical results on different application domains and analyzes the results. Finally, Section 7 provides the concluding remarks.
2 Related Work
Very few prior works have considered the problem of developing RL algorithms for nonstationary environment models. Choi et al. (2000b, a) proposed modeling changing environments in terms of hiddenmode MDPs (HMMDPs), wherein each mode (or context) captures a stationary MDP setting and mode transitions are hidden. All modes share the state and action spaces, but differ either in transition probability function of system states and/or reward function. The methods described in Choi et al. (2000a) require information about these functions for each of the modes. Additionally, algorithms which find optimal policies for systems modeled as HMMDP are computationally intensive and are not practically implementable.
A context detection based RL algorithm (called RLCD) is proposed in da Silva et al. (2006). The RLCD algorithm estimates transition probability and reward functions from simulation samples, while predictors are used to assess whether these underlying MDP functions have changed. The active context which could give rise to the current statereward samples is selected based on an error score. The error score of all contexts is estimated till the current epoch. The context which minimizes this error score is designated as the current active model. If all the contexts have a high error score, a new context is estimated. RLCD does not require apriori knowledge about the number of environment contexts, but is highly memory intensive, since it stores and updates estimates of transition probabilities and rewards corresponding to all detected contexts. Moreover, the predictor which allows detection of new contexts is heuristic and is not easy to interpret.
Theoretical framework for RL in fast changing environments based on $(\u03f5,\delta )$MDP is developed in Csáji and Monostori (2008). In this framework, if the accumulated changes in transition probability or reward function remain bounded over time and are insignificant, then Csáji and Monostori (2008) shows that changes in the optimal value function are also negligible. However, this work does not provide a control algorithm which adapts to the changes in environment.
Regret minimization algorithms proposed in Hallak et al. (2015); Jaksch et al. (2010); Ortner et al. (2019) study MDPs with varying transition probability and reward functions. These works consider a finite horizon $T$ and minimize the regret over this horizon, when the environment changes utmost $K$ times (in Hallak et al. (2015); Jaksch et al. (2010)) or arbitrarily (in Ortner et al. (2019)). In both cases, the objective of algorithm proposed in Hallak et al. (2015), the UCRL2 algorithm Jaksch et al. (2010) and variationaware UCRL2 Ortner et al. (2019) is to reduce the sum of missed rewards compared to the rewards yielded by optimal policies in the periods during which the environment parameters remain constant. However, the optimal policies defined in these works differ with respect to the performance criterion. In Jaksch et al. (2010), the optimal policies are stationary and averagereward optimal, while Hallak et al. (2015); Ortner et al. (2019) considers an optimal (possibly nonstationary) optimal policy to be totalreward optimal. Hallak et al. (2015) considers a nonstationary optimal policy, which is timeordered using as components totalreward optimal policies of all contexts.
Hallak et al. (2015) considers the setting wherein the time horizon $T$ is divided into $H$ episodes, with an MDP context picked at the start of each episode. After the context is picked (probably by an adversary), a start state for the episode is also selected. The context is selected from a finite set $\mathcal{C}$, ($\mathcal{C}=K$) but is hidden from the RL controller. The algorithm clusters the observed episodes into one of the $K\le T1$ models and classifies an episode as belonging to one of these clusters. Depending on the cluster chosen, the context is explored and rewards are obtained.
The UCRL2 and variationaware UCRL2 algorithms estimate the transition probability function as well as the reward function for an environment, and when the environment changes, the estimation restarts leading to a loss in the information collected. The objective of these algorithms is to minimize the regret during learning and not to find the appropriate policies for the different environment settings. Hence if the environment pattern alternates between two different settings (say A and B), i.e., if the change pattern is ABAB, then these algorithms restart estimation of transition probability and reward functions from scratch for both environments A and B when the second time these environments are encountered. In contrast, the objective of our method is to learn appropriate policies for each of the environments without discarding the previously gathered information.
A modelbased method for detecting changes in environment models was proposed in Banerjee et al. (2017), while Hadoux et al. (2014) proposes an extension to the RLCD method. Both these works employ quickest change detection Shiryaev (1963) methods to detect changes in the transition probability function and /or reward function. The approach in Hadoux et al. (2014) executes the optimal policy for each MDP, while parallely using cumulative sum (CUSUM) Page (1954) technique to find changes. Banerjee et al. (2017) shows that such an approach leads to loss in performance with delayed detection. It designs a twothreshold switching policy based on KullbackLeibler (KL) divergence that detects changes faster, although with a slight loss in rewards accrued. However, Banerjee et al. (2017) is limited in scope, since it assumes that complete model information of all the contexts is known. Hence, the work is not applicable in modelfree RL settings. Moreover, Banerjee et al. (2017) does not specify any technique for selecting the threshold values used in the switching strategy, even though the method proposed is completely reliant on the threshold values chosen.
A variant of Qlearning (QL), called the Repeated Update QL (RUQL) was proposed in Abdallah and Kaisers (2016). It essentially repeats the updates to the Q values of a stateaction pair and is shown to have learning dynamics which is better suited to nonstationary environment tasks, on simulation experiments. However, the RUQL faces the same issues as QL  it can learn optimal policies for only one environment model at a time. QL and RUQL update the same set of Q values, even if environment model changes. Additionally, unlike da Silva et al. (2006); Banerjee et al. (2017); Hadoux et al. (2014), QL and RUQL do not incorporate mechanisms for monitoring changes in environment. So, the RL agent utilizing QL or RUQL to learn policies cannot infer whether the model has changed. This leads to the scenario that when the environment parameters change, the RL agent starts learning policy for the new environment by completely discarding the policy learnt using samples obtained from the environment before the change.
A related problem is concept drift detection Harel et al. (2014); Iwashita and Papa (2019); Liebman et al. (2018); Roveri (2019), which has been wellstudied in supervised learning Harel et al. (2014), semisupervised learning and data mining Iwashita and Papa (2019). This area too addresses the problem of learning in dynamic environments. In supervised and semisupervised learning, all instances of the training data are usually assumed to be generated from the same “source” or “concept” (though samples may be noisy). However, in domains like recommendation systems, biosignals, concept drift occurs, wherein the same labelled dataset has different sources. This is very similar to the problem we address in this work. However, unlike the supervised and semisupervised learning approaches, concept drift in RL environment has not been explored in the literature. A recent work Liebman et al. (2018) formulates agent model retraining in presence of concept drift as a RL problem. Roveri (2019) is a recent work on concept drift in Markov chains. The authors propose concept drift detection mechanisms which is limited to finding the change in transition probabilities of Markov chains. However, this work can be used for concept drift in modelbased RL, as highlighted in Section 6.
While all the prior works have provided significant insights into the problem, there are still issues with computational efficiency. Moreover, a modelfree RL technique is needed which can retain past information and utilize it to learn better policies for all observed contexts. In the next section, we describe the underlying mathematical formulation for RL and the assumptions which are the basis of all classical RL algorithms.
3 Preliminaries
A Markov decision process (MDP) Bertsekas (2013) is formally defined as a tuple $M=\u27e8S,A,P,R\u27e9$, where $S$ is the set of states of the system, $A$ is the set of actions (or decisions) and $P:S\times A\times S\to [0,1]$ is the transition probability function. The transition function $P$ models the uncertainty in the evolution of states of the system based on the action exercised by the agent. The evolution is uncertain in the sense that given the current state $s$ and the action $a$, the system evolves into the next state according to the probability distribution $P(s,a,\cdot )$ over the set $S$. Actions are selected at certain decision epochs by the agent based on their feasibility in the given state. A decision epoch is the time instant at which the agent selects an action and the number of such epochs determines the decision horizon of the agent. When the number of decision epochs is infinite, we refer to $M$ as an infinitehorizon MDP. Depending on the application, each action yields a numerical reward (or cost), which is modeled by the function $R:S\times A\to \mathbb{R}$. Transition function $P$ and reward function $R$ define the environment model in which the system operates and the agent interacts with this environment. The interaction comprises of the action selection by the agent for the state and the environment presenting it with the future state and reward (or cost) for the action selected.
A deterministic decision rule $d:S\to A$ maps a state to its feasible actions and it models the agent’s action choice for every state. The agent picks a decision rule for every decision epoch. A stationary deterministic Markovian policy $\pi =(d,d,\mathrm{\dots})$ for an infinitehorizon MDP is a sequence of decision rules, where the deterministic decision rule does not change with the decision epochs. The value function ${V}^{\pi}:S\to \mathbb{R}$ associated with a policy $\pi $ is the expected total discounted reward obtained by following the policy $\pi $ and is defined as
$${V}^{\pi}(s)=\mathbb{E}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t},d({s}_{t})){s}_{0}=s],$$  (1) 
for all $s\in S$. Here, $$ is the discount factor and it measures the current value of a unit reward that is received one epoch in the future. The value function is the performance criterion to be optimized in the sequential decisionmaking problem modeled as MDP. Thus, the objective is to find a stationary policy ${\pi}^{*}=({d}^{*},{d}^{*},\mathrm{\dots})$ such that
$${V}^{{\pi}^{*}}(s)=\underset{\pi \in {\mathrm{\Pi}}^{SD}}{\mathrm{max}}{V}^{\pi}(s),\forall s\in S$$  (2) 
where ${\mathrm{\Pi}}^{SD}$ is the set of all stationary deterministic Markovian policies. An optimal stationary deterministic Markovian policy satisfying (2) is known to exist under the following assumptions:
Assumption 1
$$, $\mathrm{\forall}a\mathrm{\in}A\mathit{}\mathrm{\forall}s\mathrm{\in}S$.
Assumption 2
Stationary $P$ and $R$, i.e., the functions $P$ and $R$ do not vary over time.
Dynamic programming (DP) Bertsekas (2013) techniques iteratively solve (2) and provide an optimal policy and the optimal value function for the given MDP based on the above assumptions, when model information in terms of $P$, $R$ is known. Modelfree reinforcement learning (RL) Sutton and Barto (2018) algorithms on the other hand obtain the optimal policy when Assumptions 1 and 2 hold, but model information is not available (and not estimated). In the nonstationary environment scenario, as we consider, Assumption 2 is invalid. Clearly, classical RL algorithms cannot help in learning optimal policies when Assumption 2 does not hold true. In the next section, we formally describe the problem of nonstationary environments and develop an RL algorithm which can tackle nonstationary environments.
4 Problem Formulation
In this section, we formulate the problem of learning optimal policies in MDP environments with model changes and introduce the notation that will be used in the rest of the paper. We define a family of MDPs $\{{M}_{\theta}\}$, where $\theta $ takes values from a finite index set $\mathrm{\Theta}$. For each $\theta \in \mathrm{\Theta}$, we define ${M}_{\theta}=\u27e8S,A,{P}_{\theta},{R}_{\theta}\u27e9$, where $S$ and $A$ are the state and action spaces, while ${P}_{\theta}$ is the transition probability kernel and ${R}_{\theta}$ is the reward function as defined before. The agent observes a sequence of states ${\{{s}_{t}\}}_{t\ge 0}$, where ${s}_{t}\in S$. For each state, an action ${a}_{t}$ is chosen based on a policy. For each pair $({s}_{t},{a}_{t})$, the next state ${s}_{t+1}$ is chosen according to the active environment model. Changepoints refer to the decision epochs at which the environment model changes. We denote the changepoints using the set of times ${\{{T}_{i}\}}_{i\ge 1}$. Here ${\{{T}_{i}\}}_{i\ge 1}$ form an increasing sequence of random integers. Thus, for example, at time ${T}_{1}$, the environment model will change from say ${M}_{{\theta}_{0}}$ to ${M}_{{\theta}_{1}}$, at ${T}_{2}$ it will change from ${M}_{{\theta}_{1}}$ to say ${M}_{{\theta}_{2}}$ and so on, where ${\theta}_{0},{\theta}_{1},\mathrm{\dots}\in \mathrm{\Theta}$. With respect to these model changes, the nonstationary dynamics for $t\ge 0$ will be
$$  (3) 
and the reward for $({s}_{t},{a}_{t})=(s,a)$ will be
$$  (4) 
We define the randomized historydependent decision rule at time $t$ as ${u}_{t}:{H}_{t}\to \mathcal{P}(A)$, where ${H}_{t}$ is the set of all possible histories at time $t$ and $\mathcal{P}(A)$ is the set of all probability distributions on $A$. An element of ${H}_{t}$ is of the form ${h}_{t}=({s}_{0},{a}_{0},{s}_{1},{a}_{1},\mathrm{\dots},{s}_{t})$. ${u}_{t}$ is history dependent since the distribution ${u}_{t}({h}_{t})\in \mathcal{P}(A)$ picked is dependent on the sequence of states and actions observed upto time $t$. Given this rule, the next action at current state ${s}_{t}$ is picked by sampling an action from ${u}_{t}(\cdot )$. If the decision rule is dependent on the current state only, irrespective of the history upto time $t1$, then ${s}_{t}$ suffices in place of ${h}_{t}$ and the decision rule is Markovian. Deterministic Markovian decision rule ${d}_{t}:S\to A$ defined earlier in the previous section, is then equivalent to ${u}_{t}$ when ${H}_{t}$ is replaced with ${S}_{t}$ and $\mathcal{P}(A)$ is replaced with the class of degenerate probability distributions over $A$.
Given the family of MDPs $\{{M}_{\theta}\}$, the objective is to learn a policy $\pi =({u}_{1},{u}_{2},\mathrm{\dots})$ such that the longrun expected sum of discounted rewards, i.e., $\mathbb{E}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t},{u}_{t}({H}_{t})){H}_{0}={h}_{0}]$ is maximized for all initial histories ${h}_{0}\in {H}_{0}$. However, multiple issues arise when Assumption 2 is not satisfied which are compounded by the lack of model information:

1.
With changes in environment parameters, a stationary Markovian deterministic policy may not be optimal with respect to the above objective. Thus, any algorithm operating in this scenario needs to search over the space of randomized historydependent policies which is an intractable problem.

2.
When model information is not available, only samples of state and reward from simulation are available. A major problem is to use the state and reward samples to design an approximately optimal policy for nonstationary environments. Moreover, it is not clear as to which policy we must follow during the learning phase.
In the next section, we explore these issues and provide solutions to address them. We mainly provide solutions for the RL setting when model information is unavailable, suggesting what policies to be used while learning and how to detect changes. Hence, our method can also be utilized in the setting when model information is available. In this case, as we will reason out later, since the agent can compute the optimal policies individually for all MDP models/contexts, it needs to only learn when the context change occurs.
5 Our Approach
In this section, we describe the RL method to deal with changes in environment models. The method adapts a change detection algorithm Prabuchandran et al. (2019) to find changes in the pattern of statereward tuples observed while learning a policy.
5.1 Experience Tuples
The change in $P$ and $R$ ultimately manifests itself in a change in sample paths. Taking a cue from this, we consider change detection on statereward sequences. By tracking variations in the statereward sequence, our method captures variations in the occurrences of states and rewards, unlike previous works Banerjee et al. (2017); Hadoux et al. (2014). We call each sample of state and reward in this sequence as an experience tuple. Formally, an experience tuple ${e}_{t}$ at epoch $t$ is the triplet consisting of the current state ${s}_{t}$, current immediate reward (or cost) obtained ${r}_{t}$ and the next state ${s}_{t+1}$. So, ${e}_{t}=\u27e8{s}_{t},{r}_{t},{s}_{t+1}\u27e9$. The set of experience tuples $\{{e}_{t}:1\le t\le B\}$, where $B$ is a batch size, is input to the changepoint detection algorithm Prabuchandran et al. (2019).
For the proposed solution, we assume some structure in how the context changes occur. These assumptions are broadly mentioned below:
Assumption 3
The environment model changes at least once. Additionally, the pattern of model change(s) is also known and the number of such changes is finite.
Assumption 4
The environment context changes are not too frequent, i.e., we get sufficient statereward samples for every context before the environment switches to some other context.
5.2 Change Detection using Experience Tuples
We adapt the changepoint detection algorithm proposed in Prabuchandran et al. (2019) for data consisting of experience tuples. Prabuchandran et al. (2019) describes an online parametric Dirichlet changepoint (ODCP) algorithm for unconstrained multivariate data. The ODCP algorithm transforms any discrete or continuous data into compositional data and utilizes Dirichlet parameter likelihood testing to detect change points. Multiple changepoints are detected by performing a sequence of single changepoint detections.
ODCP uses an appropriate metric while detecting change points. One can also adapt other change detection algorithms like EDivisive Change Point detection (ECP) Matteson and James (2014). However, ECP uses Euclidean distance based metric to detect change points, which may not be suitable for discrete and compositional data that do not follow Euclidean geometry. Thus, ODCP reliably estimates change points compared to ECP.
ODCP requires the multivariate data to be i.i.d samples from a distribution. However, we utilize it in the Markovian setting, where the data obtained does not consist of independent samples. But the following justification helps in understanding why adapting ODCP for experience tuples might still be a good idea: Let us suppose we choose the actions according to a stationary Markovian randomized policy $\pi =\{u,u,\mathrm{\dots},\}$, where $u(s)\in \mathcal{P}(A)$. With an abuse of notation, we denote each decision rule as $\pi $, with $\pi (s)\in \mathcal{P}(A)$. Let $\psi (as)$ be the probability that action $a$ is selected in state $s$ according to the decision rule $\pi (s)$. Let ${\varphi}^{\pi}(\cdot )$ denote the steady state distribution under policy $\pi $. If we assume that the Markov reward process $\{{P}^{\pi},{R}^{\pi}\}$ is fast mixing Levin et al. (2006), then we get experience tuples based on the steadystate distribution ${\varphi}^{\pi}$. Under this condition, the tuple $(s,r,{s}^{\prime})$ namely the current state, reward and the next state will be distributed as follows:
$$(s,r,{s}^{\prime})\sim {\varphi}^{\pi}(s)\psi (\cdot s)P(s,\psi (\cdot s),{s}^{\prime}).$$ 
The tuples $({s}_{t},{r}_{t},{s}_{t+1})$ are identically distributed under data and utilize the ODCP algorithm.
We now provide some insight into the number of samples required for detecting changes in environment parameters using experience tuples. This is related to Assumption 4. The number of samples required is dictated by the size of the state and action space of the MDP. Let $m=S$, the size of the state space and $n=A$, the size of action space. Additionally, suppose ${R}^{\pi}(s,a)\in \mathcal{R},\forall (s,a)\in S\times A$, where $\mathcal{R}$ is a finite set. The number of possible staterewardstate tuples will be $m\times m\times \mathcal{R}$. For efficient detection and reduced false alarm probability, we need to get enough number of samples such that the state occupation probabilities are close to the actual steadystate probabilities.
The ODCP algorithm Prabuchandran et al. (2019) computes candidate changepoints by randomly permuting the given data samples. These candidate changepoints are ranked based on their statistical significance (see Prabuchandran et al. (2019) for more details). The number of permutations to be tested with a candidate changepoint is prefixed and is chosen based on the number of data samples. Thus, more the number of samples, higher is the number of permutations tested. Hence the number of permutations to be fixed is based on the number of experience tuples the agent obtains for every context.
5.3 Sampling Mechanism for Collecting Experience Tuples
In Section 4, we identified issues which arise when Assumption 2 is violated. To address the issue of sampling experience tuples, we design a policy that the agent can follow to collect the experience tuples. The RL method must detect a change in environment model (if it occurs) when the agent is in the process of learning a policy for controlling the MDP model. Thus, the behaviour policy which the agent utilizes to explore the MDP model (like in QL Watkins and Dayan (1992), SARSA Sutton and Barto (2018)) during learning, should also help the agent to get information about context (i.e., environment) changes. With this idea, we describe three mechanisms for exploration through behaviour policy:

1.
$\u03f5$policy: Suppose the model information pertaining to all environment contexts is known, i.e., the agent knows ${P}_{\theta}$, ${R}_{\theta}$, $\forall \theta \in \mathrm{\Theta}$. Hence, the optimal policies corresponding to all contexts is also known to the agent. However, in order to detect changes, there is a need for the agent to follow a policy which is approximately optimal. The reason to adopt an approximately optimal policy will be clear with the following example. Suppose that context changes from MDP ${M}_{0}$ to ${M}_{1}$ such that the probability and reward functions are same under the optimal policy of ${M}_{0}$, i.e., ${P}_{0}(s,{\pi}_{0}^{*}(s),{s}^{\prime})={P}_{1}(s,{\pi}_{0}^{*}(s),{s}^{\prime})$ and ${R}_{0}(s,{\pi}_{0}^{*}(s),{s}^{\prime})={R}_{1}(s,{\pi}_{0}^{*}(s),{s}^{\prime})$, but ${\pi}_{1}^{*}\ne {\pi}_{0}^{*}$ and ${\pi}_{1}^{*}$ is optimal for ${M}_{1}$. Then by following the optimal policy ${\pi}_{0}^{*}$ for ${M}_{0}$, the agent will not be able to detect changes in the environment. This is because, the distribution of statereward samples does not change under this policy ${\pi}_{0}^{*}$. So there is reason to explore other actions even though the optimal policy of both contexts is known. So, a sampling mechanism needs to explore actions other than the optimal action in order to detect changes. However, such exploration should be appropriately controlled, because there is the risk of following nonoptimal actions quite often while controlling the MDP system. As part of our solution we prescribe that experience tuples be collected using the following specific randomized policy when model information is known: at each state $s$, the agent should follow optimal action prescribed by ${\pi}_{i}^{*}$ with probability $(1\u03f5)$ and a random action with probability $\u03f5$, where $\u03f5>0$. Thus, the policy used is $\pi =(u,u,\mathrm{\dots})$, where $u:S\to \mathcal{P}(A)$, ${q}_{u}({\pi}_{i}^{*}(s))=1\u03f5$ and ${q}_{u}(a)=\frac{\u03f5}{A1}$, $a\in A\setminus \{{\pi}_{i}^{*}(s)\}$. We call this an $\u03f5$policy. Here, based on Assumption 3, the agent knows the active context $i$.

2.
Modelfree RL policies: The model information of contexts is not known. In this case, the agent can collect experience tuples while simultaneously following a modelfree learning algorithm to learn an approximately optimal policy. We propose the use of Qlearning (QL) Watkins and Dayan (1992), a modelfree iterative RL algorithm to obtain the experience tuples. QL estimates the optimal Qvalues for all stateaction pairs of an MDP. The Qvalue of a stateaction pair w.r.t policy $\pi $ is defined as the expected discounted return starting from state $s$, taking action $a$ and following policy $\pi $ thereafter. The QL iteration Watkins and Dayan (1992) requires that all stateaction pairs be explored for an infinite number of times, so that the optimal Qvalue of each pair can be accurately estimated, based on the reward obtained at each step of the algorithm. To ensure this, an exploration strategy is used. As part of our solution, we prescribe that experience tuples be collected using either of the following strategies:

•
$\u03f5$greedy: In state $s$, with probability $(1\u03f5)$, the action maximizing the Qvalue estimate of state $s$ at iteration $k$, i.e., ${\mathrm{arg}\mathrm{max}}_{b}{Q}_{k}(s,b)$ is selected, while with probability $\u03f5$, a random action in $A$ is selected.

•
UCB Tijsma et al. (2016): In state $s$, an action $a$ is selected as follows:
$$a=\underset{b}{\mathrm{arg}\mathrm{max}}\left({Q}_{k}(s,b)+C\sqrt{\frac{\mathrm{log}{N}_{k}(s)}{{N}_{k}(s,b)}}\right),$$ where ${Q}_{k}(s,\cdot )$ is the estimate of the Qvalue at iteration $k$, ${N}_{k}(s)$ tracks the number of times state $s$ is visited until $k$ and ${N}_{k}(s,b)$ is the number of times action $b$ has been picked when state $s$ is visited until $k$. Also, $C$ is a positive constant.

•
Using the samples collected, the RL agent can detect changes, which can be carried out in an online fashion. In the following subsection, we describe how $\u03f5$policy combined with ODCP can efficiently control an MDP system when model information is known.
5.4 Leveraging Knowledge of Context Information
When model information is known, the agent can compute a policy which will be optimal with respect to the objective defined in Section 4. However, as noted earlier, when Assumption 2 is violated, a stationary Markovian policy need not be optimal when model parameters $P$ and $R$ change. Computing a nonstationary or nonMarkovian policy is computationally infeasible, since this will involve search over an infinite set of policies. Moreover, the agent cannot employ standard dynamic programming techniques like value iteration and policy iteration Puterman (2005) to compute the optimal policy, since these iterative methods have been designed based on the fact that optimal policies for stationary MDPs are stationary and Markovian.
Our method is geared towards an alternate possibility. The autonomous agent can detect changepoints from the data comprising of experience tuples using the $\u03f5$policy as described above. It can further use this information to compute the optimal policies for nonstationary MDP environments. The exact form of this nonstationary policy can also be described. The agent (based on Assumption 3), knows exactly which MDP context is active at the starting decision epoch. It begins to collect experience tuples using $\u03f5$policy and simultaneously analyzes these samples for changes using ODCP algorithm. When the agent can sense that the context has changed, it switches to the optimal policy of the MDP context which is next in the known pattern of changes and which it presumes to be the current active context (based on the changepoint computed). After the policy switch, the agent continues to analyze the samples for changes. Then similar technique of switching continues for other contexts in the pattern as well. It is clear that this switching of policies gives rise to a nonstationary policy when all the individual policies are ordered together. Our method does not search over the large space of nonstationary, nonMarkovian policies and instead gives a piecewise stationary policy that the agent can easily compute.
5.5 Context Qlearning
For the scenario where model information of all contexts is known, the method just described is useful to obtain a piecewise stationary policy. However, in the case when context information is not known, we need to design a method for finding the optimal policy. Such a method should be sensitive to environment changes as well. We design Context Qlearning (Context QL), which is a method that can handle the learning task when model/context information is not known. The concept of Context QL is in one respect similar to RLCD da Silva et al. (2006). Both methods instantiate new models whenever a change is detected. However, unlike RLCD which utilizes a heuristic quantity for tracking and declaring changepoints, our method works in tandem with a changepoint detection algorithm, ODCP, to get information about the changes in contexts. Furthermore, Context QL updates Q values of the relevant model whenever a change is detected and does not attempt to estimate the transition and reward functions for the new model. Additionally, if the method finds that samples are obtained from a previously observed model, it updates the Q values corresponding to that model. Thus, in this manner, the information which was learnt and stored earlier (in the form of Q values) is not lost. The Context QL pseudocode is given in Algorithm 1.
The Context QL algorithm takes as input the pattern of changes in the environment models ${M}_{1},\mathrm{\dots},{M}_{k}$, so that the Q values of the right model are updated. It is to be noted that only the change pattern is known and not the context information. However, when these model changes occur, the changepoints ${T}_{1},\mathrm{\dots},{T}_{n}$ are not known to Context QL. For example, suppose the agent knows that model changes from say ${M}_{0}$ to ${M}_{1}$ and then to ${M}_{2}$. This implies that the agent knows that dynamics change from ${M}_{0}$ to ${M}_{1}$, but it does not know ${P}_{0}$, ${R}_{0}$, ${P}_{1}$, ${R}_{1}$ etc. Based on the pattern information, Context QL updates Q values pertaining to model ${M}_{0}$ initially. Later when first change is detected, it updates Q values of ${M}_{1}$, followed by updates to Q values of ${M}_{2}$ when another change is detected. The algorithm initializes a context counter $c$, which keeps track of the current active context, according to the changes detected. It maintains Q values for all known contexts $1,\mathrm{\dots},k$ and initializes the values to zero. The learning begins by obtaining experience tuples ${e}_{t}$ according to the dynamics and reward function of context ${M}_{{\theta}_{1}}$. The state and reward obtained are stored as experience tuples, since model/context information is not known. The samples can be analyzed for context changes in batch mode or online mode, which is denoted as a function call to ODCP in the algorithm pseudocode. If ODCP detects a change, then the counter $c$ is incremented, signalling that the agent believes that context has changed. The lines 616 represent this learning phase when context ${M}_{{\theta}_{1}}$ is active. Similar learning takes place for other contexts as well.
$$Q({\theta}_{c},{s}_{i},{a}_{i}):=(1\alpha )Q({\theta}_{c},{s}_{i},{a}_{i})+\alpha ({r}_{i}+\gamma \underset{b}{\mathrm{max}}Q({\theta}_{c},{s}_{i+1},b))$$ 
In line 1, the input uses the notation ${M}_{{\theta}_{i}}$ to indicate that the system can initially evolve according to any of the $k$ models and context ${M}_{0}$ need not be the initial active context. Thus, $1\le {\theta}_{i}\le k$. The information ${T}_{1},{T}_{2},\mathrm{\dots},{T}_{n}$ is known only to the environment. In Algorithm 1 it is shown as though the RL agent is aware of it. But this is not so, it has been included only to indicate that at decision epochs independent of the agent, the context changes.
Remark 1
In Section 4 and here, we have mentioned that given a family of contexts, $\{{M}_{\theta}\}$, the objective is to learn a policy $\pi =({u}_{1},{u}_{2},\mathrm{\dots})$ such that the expected sum of discounted rewards accumulated over the infinite horizon i.e., $\mathbb{E}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t},{u}_{t}({H}_{t})){H}_{0}={h}_{0}]$ is maximized for all initial histories ${h}_{0}\in {H}_{0}$. In this Section, we described our method, Context QL, which finds optimal policies for each environment context that maximizes (or minimizes) the longrun discounted reward (or cost) criterion. It should be noted that, if, instead we have the average reward criterion for the policy wherein the objective is to maximize ${g}^{\pi}(s)=\underset{N\to \mathrm{\infty}}{lim}\frac{1}{N}\mathbb{E}[\sum _{t=1}^{N}R({s}_{t},d({s}_{t})){s}_{0}=s]$, $\forall s\in S,$ over all policies $\pi \in {\mathrm{\Pi}}^{SD}$, then our method can still be utilized. This is possible if we adapt the relative Qvalue iteration algorithm Abounadi et al. (2001) to nonstationary settings. The Context QL method can be easily extended to the average reward per step setting in this manner.
5.6 Performance Evaluation and Policies
To evaluate our method, we need a performance metric. In the classical case of stationary MDPs, the RL algorithms learn a policy (maybe deterministic or randomized) and the algorithms are evaluated based on the rewards these policies yield when they are used to control the MDP model. So, for the nonstationary algorithms proposed, in a similar fashion, we compute the rewards garnered by the algorithms. Sections 6.16.3 show the performance of Context QL w.r.t the cumulative reward gathered when the model information is not available. This is in contrast with prior works like UCRL2 Jaksch et al. (2010), variationaware UCRL Ortner et al. (2019), where regret is the performance metric (see Jaksch et al. (2010); Ortner et al. (2019) for a formal definition). Regret is a measure of the rewards which are missed by the policies yielded by these algorithms when compared to the optimal policy. In order to compute this performance metric, we ought to know the optimal policy and its performance. Clearly, the idea of computing regret is not applicable to modelfree RL, where model information is not known. Thus, in our view, measuring regret is not apt for the problem we consider. However, when model information is known, regret can be computed. We show some results for this case in Section 6.1.
With nonstationary environments, Assumption 2 does not hold true. Hence, there is no reason to believe that a stationary policy is optimal when this assumption is not satisfied in a system. There is a need to search the large space of nonstationary policies for ensuring optimal behaviour of the MDP system model in dynamic environments. An exhaustive search over this large space is computationally expensive and we require an algorithm which can find an approximately optimal policy over a smaller, restricted set. The Context QL algorithm discussed in Section 5.5 does the same. The smaller, restricted set of nonstationary policies we are interested in is the set of piecewisestationary policies, as described in Section 5.4. We evaluate the experience tuples for changes in the underlying distribution and find the changepoints. Before the change occurs, the RL agent follows a stationary optimal policy learnt for the respective context. After the change, the RL agent switches and learns another stationary policy. Also, if the environment was previously observed, then the RL agent updates the policy learnt for the same environment, by utilizing the additional samples obtained.
5.7 Scalability of Context QL
Context QL (Algorithm 1) stores Q values for every context. From the learnt Q values, a policy is derived Watkins and Dayan (1992) and followed. However, this memory requirement is independent of the number of changes in the environment. For e.g., if ${\theta}_{i}\in \mathrm{\Theta}=\{1,2\}$ and the model change pattern is ${M}_{{\theta}_{1}}\to {M}_{{\theta}_{2}}\to {M}_{{\theta}_{1}}\to {M}_{{\theta}_{2}}\to {M}_{{\theta}_{1}}\to {M}_{{\theta}_{2}}$, the number of Q tables stored will be two, one for each of the models ${M}_{{\theta}_{1}}$ and ${M}_{{\theta}_{2}}$. In this example the number of changes is $5$, which does not affect learning of either of the two models. However, the learning accuracy and the policy learnt is directly dependent on how many samples are obtained for each of the contexts/models.
6 Experimental Results
In this section, we evaluate our method for accuracy in the changepoints detected and the reward accrued. The experience tuples are collected from randomlygenerated MDPs (in Section 6.1), from a sensor application (see Section 6.2) and a traffic application (see Section 6.3). All numerical experiments are carried out using R statistical package and Python programming language.
We also compare the accuracy of changepoints detected by ODCP Prabuchandran et al. (2019) and EDivisive (ECP) Matteson and James (2014) for the data consisting of experience tuples. ECP is a changepoint detection algorithm. It is seen to perform well on many synthetic multivariate datasets. Numerical simulations described in Matteson and James (2014) show that it can detect changes in mean, variance and covariance of dataset. Additionally, it is seen to perform well on many real datasets Matteson and James (2014).
6.1 Random MDP
We test our method on different Random MDP models generated using MDPtoolbox ^{1}^{1} 1 https://cran.rproject.org/web/packages/MDPtoolbox. First, the methods are tested for single changepoint detection followed by multiple changepoints detection. The results below are grouped according to this. In all experiments, the discount factor $\gamma $ is set to $0.9$.
6.1.1 Single Changepoint Detection
We consider model change from ${M}_{0}$ to ${M}_{1}$ and collect $2000$ samples (i.e., the threedimensional experience tuples). The actual model change occurs at ${T}_{1}=1000$. The first $1000$ statereward samples are from ${M}_{0}$ and the rest $1000$ samples are from ${M}_{1}$. Let ${\tau}^{*}$ be the changepoint detected by the various methods. Table 1 summarizes the changepoints detected when we assume that model information is known and the agent needs to determine when the environment switches from ${M}_{0}$ to ${M}_{1}$. Note that, as remarked in Section 5, the best possible alternative to using a nonstationary policy in this scenario is to switch to an appropriate stationary policy. Since model information is known, optimal policies for both environments are also known and the agent can switch between these optimal policies if it can reliably determine the changepoint. In order to detect changes, the samples are collected using a randomized policy with $\u03f5=0.1$, as described in Section 5.3.
Table 2 summarizes the changepoints detected when we assume that model information is not known. Thus, in addition to determining the changepoints, the agent also needs to estimate the optimal policies for the environments. In order to detect changes, the samples are collected using $\u03f5$greedy and UCB policies.
Change Detection Algorithm  Mean of ${\tau}^{*}$  SD of ${\tau}^{*}$  Median of ${\tau}^{*}$ 

ODCP with $\u03f5$policy  $999.3$  $25.88659$  $1001.5$ 
ECP with $\u03f5$policy  $1008.65$  $32.21029$  $1002$ 
Method  Mean of ${\tau}^{*}$  SD of ${\tau}^{*}$  Median of ${\tau}^{*}$ 
ODCP, QL with UCB  $1001.11$  $31.86$  $997$ 
ECP, QL with UCB  $1006.22$  $36.65$  $1001$ 
ODCP, QL with $\u03f5$greedy  $986.9444$  $34.769$  $999$ 
ECP, QL with $\u03f5$greedy  $1027.833$  $59.69$  $1002$ 

In Tables 1 and 2, the mean, median and standard deviation of ${\tau}^{*}$ (over $20$ sample trajectories) is presented. It can be observed from Table 1 and Table 2 that ODCP and ECP detect change in model using just tuples (changes manifest in tuples), promising that experience tuples can be utilized to detect changes in environment contexts. As can be inferred from these results, the average and median of ${\tau}^{*}$ found by ECP and ODCP are very close, but mean ODCP performance is better compared to ECP. However, it is observed that when experience tuples are analyzed for changepoints, ECP has a higher standard deviation in the changepoints detected when compared to ODCP. In Table 2, ODCP with QL and ECP with QL utilize QL to learn policies, but these do not maintain separate Qvalues for the environments. This table also shows results with two exploration strategies. In Table 2, UCB exploration is much better than $\u03f5$greedy in both ODCP and ECP algorithms. This is because of the constant value of $\u03f5$. Even after optimal actions are learnt for the various states, exploratory actions would continue to be chosen. This is not so with UCBbased algorithms. Also, ODCP shows better results on the whole as compared to ECP.
For the case when model information is not known, we also implemented RLCD da Silva et al. (2006) and its extension Hadoux et al. (2014) to detect changes in the environment. These algorithms rely on tracking the quality of the model learnt. If the quality crosses some threshold, a new model is instantiated and estimated. The choice of this threshold is crucial. In our simulations, we observed that RLCD instantiates more than four environment models, even though there is a single change in the context. The same results were observed with Hadoux et al. (2014). Due to this reason, the results pertaining to RLCD and Hadoux et al. (2014) were not included in Table 2.
In Tables 3, 4 we analyze the precision and recall Tatbul et al. (2018) metric for algorithms proposed in Banerjee et al. (2017); Roveri (2019) and compare it with the performance of Context QL with respect to this metric. For this, the model information is known and all policies are computed based on the context information. The twothreshold switching strategy proposed in Banerjee et al. (2017) utilizes the known transition probabilities of all contexts to compute the ShiryaevRoberts statistic (SR) Shiryaev (1963) for every decision epoch. This value is initialized to $0$ and is updated as follows:
$$S{R}_{t+1}=(1+S{R}_{t})\frac{{P}_{{\theta}_{1}}(s,a,{s}^{\prime})}{{P}_{{\theta}_{0}}(s,a,{s}^{\prime})},\forall t>0.$$ 
If SR is lower than a prefixed threshold $B$, then optimal policy ${\pi}_{{\theta}_{0}}^{*}$ corresponding to the MDP model ${M}_{{\theta}_{0}}$ is followed. However, if SR is greater than $B$, but lower than another prefixed threshold $A>B$, then a policy ${\pi}_{KL}$ as designed below is followed:
$${\pi}_{KL}(s)=\underset{a\in A}{\mathrm{arg}\mathrm{max}}\sum _{{s}^{\prime}}{P}_{{\theta}_{1}}(s,a,{s}^{\prime})*\mathrm{log}\left(\frac{{P}_{{\theta}_{1}}(s,a,{s}^{\prime})}{{P}_{{\theta}_{0}}(s,a,{s}^{\prime})}\right),\forall s\in S.$$ 
The above policy ${\pi}_{KL}$ is designed for better detection of changes, though it may not be an optimal policy for model ${M}_{{\theta}_{0}}$. Once SR crosses the threshold $A$, the optimal policy ${\pi}_{{\theta}_{1}}^{*}$ is followed. The key feature of the twothreshold strategy is to prefix appropriate values of $A$ and $B$. Banerjee et al. (2017) does not provide any method for identifying these values and hence in the experiments, we have shown results of precision and recall with multiple sets of these values. The PCDM and NPCDM algorithms proposed in Roveri (2019) detect changes in the transition probability of discretetime Markov chains (DTMCs). We adapt these to find concept drift in MDPs when context information is known. MDP context information being known implies that the optimal policies of each MDP model can be computed. When an optimal policy is followed, we obtain a Markov chain for the corresponding MDP model. For this chain, we further adapt the algorithms PCDM and NPCDM to find the changepoints. PCDM and NPCDM algorithms compute the stationary distribution ${\xi}_{j}$ for context ${M}_{{\theta}_{j}},\forall j\in \{1,2,\mathrm{\dots},k\}$. Both algorithms consider the nonoverlapping sequences of states which evolve based on the known transition probabilities. A moving window ${w}_{i}$ of state samples at epoch $i$ comprises of the states $\{{s}_{({w}_{i1}+1)},\mathrm{\dots},{s}_{{w}_{i}}\}$. PCDM and NPCDM differ only in the manner in which this sequence is picked. For the sequence picked, PCDM and NPCDM compute the loglikelihood ratio ${l}_{i}=\mathrm{log}\left(\frac{{\mathbb{P}}_{{\theta}_{1}}({w}_{i})}{{\mathbb{P}}_{{\theta}_{0}}({w}_{i})}\right)$, where
$${\mathbb{P}}_{{\theta}_{j}}({w}_{i})={\xi}_{j}({s}_{({w}_{i1}+1)})\prod _{k={w}_{i1}+1}^{{w}_{i}1}{P}_{{\theta}_{j}}(s,{\pi}_{{\theta}_{j}}^{*},{s}^{\prime}).$$ 
These algorithms track the sign of ${l}_{i}$ and compute ${m}_{i}=\mathrm{max}(0,{m}_{i1}+sign({l}_{i}))$. If ${m}_{i}\ge K$ where $K$ is a prefixed threshold value, then a change is declared. In the numerical simulations, we vary $K$ and note the performance of PCDM and NPCDM w.r.t the precision and recall metric. For the numerical experiments in Tables 3, 4 a datapoint being labelled as a changepoint is said to be correct if it lies within a predefined window size $W=100$, $W=50$ respectively, of the actual, true changepoint. Based on this criterion, for each of the algorithms, the results in Tables 3, 4 are obtained by averaging over $20$ Monte Carlo simulations, i.e., we sample $20$ sample paths randomly for performing MonteCarlo averaging.
PCDM  NPCDM  Twothreshold Switching Strategy  Context QL  
K  Prec.  Recall  K  Prec.  Recall  A  B  Prec.  Recall  Prec.  Recall 
5  1  0.7  5  0.1  1  100  50  0.05  1  
10  1  0.05  10  0.6  0.642  1000  500  0.706  0.8  1  0.75 
15  0  0  15  0.8125  0.7647  10000  5000  0.75  0.333  
20  0  0  20  0.8235  0.8235  100000  50000  1  0.2 
From Table 3, we observe that PCDM shows very poor performance, when compared to the other algorithms. With a smaller window size, this performance is only expected to degrade and hence results of PCDM are not included in Table 4. It is seen that with increase in the value of $K$, the performance of NPCDM improves. However, similar to Banerjee et al. (2017), where values of $A$ , $B$ are fixed after trial and error, the values of $K$ are also prefixed in a similar manner  this implies there is no single value of $K$ which gives good results for all sizes of stateaction space. This can be inferred from Table 5, where we used $K=20$ for NPCDM. Though this value gives good precision and recall values with $S=A=5$, the same value, when used for larger stateaction space MDPs degrades the performance of NPCDM. This can be observed in Table 5. With a smaller window, the precision and recall values are expected to decrease, which is observed in Table 4. Note that, since Context QL has no threshold values to be tuned, we have a single row for Context QL in both Tables 3 and 4. The identical row for Context QL in both Tables 3 and 4 indicates that the changepoints detected by Context QL lies within the smaller window size of $50$.
NPCDM  Twothreshold Switching Strategy  Context QL  
K  Prec.  Recall  A  B  Prec.  Recall  Prec.  Recall 
5  0.1  1  100  50  0.05  1  
10  0.6  0.642  1000  500  0.533  0.6154  1  0.75 
15  0.8125  0.7647  10000  5000  0.6  0.1667  
20  0.4  0.1176  100000  50000  0  0 
Table 5 shows the variation in precision and recall metric of Context QL, twothreshold switching strategy and NPCDM with respect to the variation in the size of stateaction space of MDPs. We have set a window size of $W=100$ for these simulations. Based on Table 3, we set $K=20$ for NPCDM and $A=1000$, $B=500$ for twothreshold switching strategy. Though we have inferred that no single value of $K$, $A$, $B$ is apt for these algorithms, we have used a single value for the experiments in Table 5, since our focus is to show the changes in precision and recall metric with changes in stateaction space.
$S$  $A$  Context QL  Twothreshold Switching Strategy  NPCDM  
Prec.  Recall  Prec.  Recall  Prec.  Recall  
5 
5  1  0.75  0.706  0.8  0.8235  0.8235 
8  8  1  1  0  0  0.8  0.706 
12  12  1  1  0  0  0.723  0.632 
In Table 5, we see that performance of twothreshold switching strategy and NPCDM degrades with increase in size of stateaction space. This degradation is gradual in NPCDM, however with twothreshold strategy, there is a drastic decrease in the performance. This is due to the fact that the prefixed threshold values which are apt for $S=A=5$, are not apt for $S=A=8$ and $S=A=12$. However, for other sizes of stateaction space, we do not know how to predetermine and fix appropriate threshold values.
Next, in Table 6, we provide results for the rewards collected by QL, Context QL, UCRL2 Jaksch et al. (2010), RLCD da Silva et al. (2006) and RUQL Abdallah and Kaisers (2016) when the model information is not known. These results show the performance of these algorithms when there is a change in the environment from ${M}_{0}$ to ${M}_{1}$ at ${T}_{1}=1000$ with total number of experience tuples collected being $2000$. As a reference, we also provide the reward collected by ODCPbased and ECPbased methods when model information (and hence optimal policies) is known for both environments. The ODCPbased and ECPbased methods use $\u03f5$policy for detecting changes and control MDP using optimal policy of ${M}_{0}$ followed by that of ${M}_{1}$ after a change is detected. Suppose the changepoint detected by ODCP is ${\tau}^{*}$. Let ${\pi}_{0}^{*}$ and ${\pi}_{1}^{*}$ be the optimal policies of ${M}_{0}$ and ${M}_{1}$ respectively. The ODCP and ECP methods with $\u03f5$policy, control the MDP using policy ${\pi}_{0}^{*}$ from epoch $1$ to epoch ${\tau}^{*}1$ and from epoch ${\tau}^{*}$ these pick actions according to ${\pi}_{1}^{*}$.
QL does not detect changepoints and instead updates the same set of Q values, unlike Context QL (as described in Algorithm 1). RUQL is similar to QL, but its exploration strategy and stepsize schedule differ from QL. UCRL2 estimates the upper confidence bounds on the transition probability and reward functions as described in Jaksch et al. (2010). Also, we restart UCRL2 algorithm when the number of steps satisfies the criterion described in Theorem 6 of Jaksch et al. (2010).
Method  Mean $\pm $ SD  Median 

ODCP $\u03f5$policy  $538.67\pm 14.28$  $538.0404$ 
ECP $\u03f5$policy  $515.1421\pm 23.26$  $520.2686$ 
Context QL  $365.6\pm 64.14$  $362.06$ 
QL  $287.4\pm 69$  $284.4$ 
UCRL2  $176.83\pm 11.25$  $178.3511$ 
RUQL  $103.41\pm 26.6$  $99$ 
RLCD  $91.3341\pm 20.77$  $97.41$ 
As observed from Table 6, RLCD and RUQL perform poorly when compared to QL, UCRL2 and Context QL in terms of the total reward obtained. In UCRL2 though the reward obtained is less when compared to QL and Context QL, the standard deviation is much lower when compared with QL and Context QL. However, unlike QL and Context QL, it has high space complexity, since it maintains estimates of probability and reward function along with various stateaction counters required for the upper confidence bounds.
In the numerical experiments of Table 6, we used $\u03f5$greedy policy exploration for QL and Context QL. QL utilizes one set of Q values to learn optimal policies for contexts ${M}_{0}$ and ${M}_{1}$. It does not detect changes, but updates Q values using rewards from both the environments. This is where the issue arises. When the environment changes at ${T}_{1}$ from model ${M}_{0}$ to ${M}_{1}$, the Q values would have been updated using rewards from ${M}_{0}$. Once the environment starts providing samples from context ${M}_{1}$, the action choice of the agent is biased by the already updated Q values. This does not occur in the Context QL, since once a change is detected, Context QL starts updating another set of Q values for the new environment.
The performance of algorithms w.r.t the regret criterion is shown in Table 7, when the changepoints obtained are analyzed by ODCP and ECP. The regret is computed for each sample trajectory comprising of $2000$ experience tuples, with a changepoint at ${T}_{1}=1000$. The results in Table 7 are computed over $20$ such sample trajectories.
Method  Mean $\pm $ SD  Median 

Modelbased, ODCP $\u03f5$policy  $20\pm 15$  $16.6$ 
Modelbased, ECP $\u03f5$policy  $39\pm 26$  $52$ 
6.1.2 Multiple Changepoints Detection
We evaluate the accuracy of ODCP Prabuchandran et al. (2019) and ECP Matteson and James (2014) on MDP for multiple changepoints. In the experiments, the model alternates thrice between ${M}_{0}$ and ${M}_{1}$ starting with ${M}_{0}$. With $2000$ samples, changepoints are fixed at ${T}_{1}=500$, ${T}_{2}=1000$ and ${T}_{3}=1500$. Averaged over $20$ Monte Carlo simulations, mean of ${\tau}_{1}^{*}=520$, mean of ${\tau}_{2}^{*}=1059$ and mean of ${\tau}_{3}^{*}=1510$ for our method, while ECP identifies only the first changepoint with mean $855$. More importantly, ECP fails to detect the second and third changepoints. The reliability with which ODCP detects changepoints is better compared to ECP. In this setting of multiple changepoints, we also compared the performance of RLCD da Silva et al. (2006) with ODCP. The RLCD algorithm detects more than $3$ changepoints in this scenario.
In Table 8, we show the total reward gained by the policies learnt by algorithms QL, UCRL2, RUQL and Context QL. For this, the samples were obtained from a Random MDP system. The MDP environment alternates between ${M}_{0}$ and ${M}_{1}$ twice, i.e., the sequence of context changes is ${M}_{0}\to {M}_{1}\to {M}_{0}\to {M}_{1}$. QL and RUQL maintain a single set of Q values to learn the policy and UCRL2 is simulated with restarts as described in Theorem 6 of Jaksch et al. (2010). Context QL maintains two different estimates of the Q values  one for each MDP environment and resumes updating the appropriate Q values when a change is detected. For the numerical experiments, we get $T=4000$ $(s,{s}^{\prime},r)$ samples, with ${T}_{1}=1000$, ${T}_{2}=2000$ and ${T}_{3}=3000$. Thus, at ${T}_{1}$, the MDP model changes from ${M}_{0}$ to ${M}_{1}$. At ${T}_{2}$, it again flips to ${M}_{0}$ and so on.
Method  Mean $\pm $ SD  Median 

Context QL  $945.94\pm 146.54$  $959.92$ 
QL  $866.9\pm 89$  $882.2$ 
UCRL2  $439.78\pm 27.31$  $436.98$ 
RUQL  $253.18\pm 40$  $255.54$ 
6.2 Energy Management in a Single Sensor Node with Finite Buffer
We consider the model described in Prabuchandran et al. (2013) which proposes an energy management (EM) MDP model for a sensor node with energy harvesting capabilities. Sensor node has a finite energy buffer to store the energy harvested from the ambient environment and a finite data buffer to store the data generated. The authors assume that energy units are harvested at a mean rate of ${\lambda}_{E}$, while data bits are generated at a mean rate of ${\lambda}_{D}$. The state of the system comprises of the current levels of energy and data buffers and the RL agent needs to decide on the number of energy units to be used for transmission. The actual number of data bits transmitted is a nonlinear function of the number of energy units utilized. The RL agent needs to minimize the longterm discounted cost by finding a suitable policy. The immediate cost per step is the queue length of data buffer after successful transmission. In Prabuchandran et al. (2013), model information is unknown and hence QL is used to find optimal EM policies.
A sensor which is designed to harvest energy from the ambient environment, like for e.g., solar energy, has to appropriately modify its policy based on how ${\lambda}_{E}$ changes with day timings. We assume that the sensor monitors a physical system which generates data at a fixed rate that does not change over time. A change in ${\lambda}_{E}$ gives rise to nonstationary environments. We consider this scenario and show that our method is effective in handling changing mean rate of energy harvest, when compared to QL, repeated update QL (RUQL) Abdallah and Kaisers (2016). In our experiments, the exploration strategy used is $\u03f5$greedy with $\u03f5=0.1$. We analyze our method and QL, RUQL for regret when the environmental model changes once. The number of iterations for learning phase is set to $4000$ with a changepoint at $2000$.
Method  Mean $\pm $ SD  Median 

Context QL  $498.75\pm 48.78$  $500$ 
RUQL  $803.7\pm 121.3673$  $800.5$ 
QL  $675.5\pm 50.96$  $677$ 

The longrun discounted cost obtained by our method, QL and RUQL is shown in Table 9. We are unable to compute the regret of these algorithms since model is unknown.
From Table 9, it is clear that Context QL does better compared to other modelfree nonstationary algorithms like RUQL. The purpose of evaluating policy learnt by QL in nonstationary environments is that if policy learnt by QL has performance which is comparable to Context QL, then it is easier to use QL albeit with a slight loss in performance. However, from Table 9, we observe that by using QL, RUQL, there is considerable loss in performance when environment contexts change. This is an implication of the fact that Context QL remembers policies for the environment models and updates the right set of Q values corresponding to the model. QL, RUQL on the other hand update the same set of Q values for the two different contexts. This means the policy learnt by QL, RUQL is good only for the context which was the last to be observed in the sequence of contexts ${M}_{{\theta}_{1}},{M}_{{\theta}_{2}},\mathrm{\dots},{M}_{{\theta}_{n}}$, where $n=2$ in the above experiments. For the same sequence of models, ContextQL learns a good policy for each of these contexts and stores them in the form of Q values. While evaluating, we detect the change in context and follow the appropriate policy for the next model in the sequence.
6.3 Traffic Signal Control
As highlighted in Section 1, vehicular traffic signal control is a sequential decisionmaking problem. In the experiments, we show that our method is effective in finding changes in vehicular patterns and learn the optimal policies for the same. The experimental setup consists of a single junction with four incoming lanes. The traffic junction is illustrated in Fig. 3, 4 which are snapshots of the simulations carried out in VISSIM ^{1}^{1} 1 http://visiontraffic.ptvgroup.com/. The junction is controlled by a signal. We model the traffic signal duration control as a MDP following Prashanth and Bhatnagar (2011). The state of the junction is the information consisting of queue lengths of all incoming lanes and the current phase. The phase indicates which incoming lane should be given the green signal. In order to tackle the state space dimensionality, we aggregate the queue lengths of lanes as $\text{low}=0$, $\text{medium}=1$ and $\text{high}=2$. Hence, if a lane congestion level is onethird of its length, then we say that the aggregated state of that lane is $0$. If the lane congestion level is higher than onethird of the lane distance but lower than twothirds the distance, then aggregated state of that lane is $1$. For high congestion levels, the aggregated state is $2$. With this lane queue length aggregation scheme, the state space dimensionality is reduced to ${3}^{4}\times 4$. The actions for the signal controller correspond to the set of green signal durations $\{20,25,\mathrm{\dots},70\}$ in seconds. The immediate cost is the sum of the lane queue lengths and the RL agent must minimize the longterm discounted cost.
The traffic RL agent learns a policy using QL. We train the traffic RL agent for ${10}^{6}$ simulation seconds with a change in vehicular input volumes after the simulation has run for half the time. A change in the input vehicular volumes causes a change in the environment dynamics.
The longrun discounted costs obtained by QL and Context QL are shown in Table 10.
Method  Mean $\pm $ SD  Median 

Context QL  $1100.022\pm 34$  $1000$ 
QL  $1400.1\pm 67$  $1300$ 
6.4 Summary of Results
As observed in Tables 110, the results of our method are promising. Tables 1, 2 show the detection delay of the modelbased ODCP and ECP algorithms when adapted to the RL setting. Their performance is comparable for both settings  when model information is known and when it is not known. Both methods detect changes with least detection delay as can be judged from the mean, median and standard deviation values shown in these tables. The false positive, false negative errors and the true positive values can be inferred from Tables 3, 4 and 5 which indicate the precision and recall metric for NPCDM Roveri (2019), twothreshold switching strategy Banerjee et al. (2017) and Context QL. It can be seen that Context QL is quite robust to the size of stateaction space, has good performance and does not require tuning of any threshold values, unlike NPCDM or the twothreshold switching strategy. Tuning these values requires model information and lot of trial and error. With reallife applications, one may not have model information. Tuning the values by trial and error may not be even feasible in reallife applications. Thus, Context QL presents a great advantage when the algorithm is to be used in realworld applications.
Tables 6, 8 show the average total reward obtained by QL, RUQL, UCRL2, RLCD and Context QL. The gap in total reward obtained by QL and Context QL is explicit in the multiple changepoints case where the contexts change in the fashion ${M}_{0}\to {M}_{1}\to {M}_{0}\to {M}_{1}$. This is because, Context QL has separate Q value data structure for each environment. So, after the second and third changepoints, it faces the same environment that was observed earlier. The samples obtained after 2nd changepoint (i.e., after ${\tau}_{2}^{*}$) are used to update Q values of ${M}_{0}$ again. Similarly, after the 3rd changepoint (i.e., after ${\tau}_{3}^{*}$), the reward samples are used to update Q values of ${M}_{1}$ again. Hence even though the environment alternates between ${M}_{0}$ and ${M}_{1}$ twice, for each environment, we get almost equal number of reward samples.
In the experiments, we tested Context QL on two realistic applications  one in energy management in energy harvesting sensors and the other in traffic signal duration control. These are applications where the effect of changing environment parameters or system operating conditions is clearly visible. It is seen in Tables 9 and 10 that Context QL performs much better when compared to classical QL. Context QL captures the change in joint distribution of $(s,a,r)$ tuples. When this distribution changes, the ODCP indicates that a changepoint is detected. This detection is based on permutation tests in ODCP and is hence flexible. It should be noted that Assumptions 3, 4 are important for our method. These assumptions are reasonable, because it is often observed that in applications like traffic signal control and sensor energy management, although the operating conditions change, they do not change very frequently. For e.g., in traffic signal control, the conditions change when vehicular inflow rate changes. It is observed that the vehicular rate remains constant for at least three to four hours in a single day before changing, depending on rushhour traffic and early morning or late night traffic. Hence before every change, we get sufficient number of state and reward samples for our method. Additionally, in applications, the pattern of context changes is also usually known.
7 Conclusions
This work develops a modelfree RL method known as Context QL for learning optimal policies in changing environment models. A novel change detection algorithm for experience tuples is used to determine changes in environment models in conjunction with QL. The numerical experiments in realistic applications show that Context QL is promising, since, the policies learnt by the method are seen to perform well in varying operating conditions and give better return, when compared to classical QL and other nonstationary learning algorithms like RUQL and RLCD. We additionally showed the statistical performance of Context QL with respect to the precision and recall metric, which is an important metric in time series and changepoint literature.
Future enhancements to this work can focus on detecting changes in the context of large and continuous state space models. Such an extension will indeed be useful in robotics and intelligent transportation systems. Also, if the sequence of context changes is not known, then we require some changes to the proposed method. An interesting extension involves metalearning Nagabandi et al. (2018) which can be adapted to continuous and large stateaction space MDPs.
References
 Abdallah and Kaisers (2016) Abdallah S, Kaisers M (2016) Addressing Environment NonStationarity by Repeating Qlearning Updates. Journal of Machine Learning Research 17(46):1–31
 Abounadi et al. (2001) Abounadi J, Bertsekas D, Borkar V (2001) Learning Algorithms for Markov Decision Processes with Average Cost. SIAM Journal on Control and Optimization 40(3):681–698, DOI 10.1137/S0363012999361974
 Andrychowicz et al. (2019) Andrychowicz M, et al. (2019) Learning dexterous inhand manipulation. The International Journal of Robotics Research DOI 10.1177/0278364919887447
 Banerjee et al. (2017) Banerjee T, Miao Liu, How JP (2017) Quickest change detection approach to optimal control in markov decision processes with model changes. In: 2017 American Control Conference (ACC), pp 399–405, DOI 10.23919/ACC.2017.7962986
 Bertsekas (2013) Bertsekas D (2013) Dynamic Programming and Optimal Control, vol II, 4th edn. Athena Scientific, Belmont,MA
 Cano and Krawczyk (2019) Cano A, Krawczyk B (2019) Evolving rulebased classifiers with genetic programming on gpus for drifting data streams. Pattern Recognition 87:248 – 268, DOI https://doi.org/10.1016/j.patcog.2018.10.024
 Choi et al. (2000a) Choi SP, Yeung DY, Zhang NL (2000a) HiddenMode Markov Decision Processes for Nonstationary Sequential Decision Making. In: Sequence Learning, Springer, pp 264–287
 Choi et al. (2000b) Choi SPM, Yeung DY, Zhang NL (2000b) An Environment Model for Nonstationary Reinforcement Learning. In: Solla SA, Leen TK, Müller K (eds) Advances in Neural Information Processing Systems 12, MIT Press, pp 987–993
 Csáji and Monostori (2008) Csáji BC, Monostori L (2008) Value Function Based Reinforcement Learning in Changing Markovian Environments. J Mach Learn Res 9:1679–1709
 Dick et al. (2014) Dick T, György A, Szepesvári C (2014) Online Learning in Markov Decision Processes with Changing Cost Sequences. In: Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, JMLR.org, ICML’14, p I–512–I–520
 Ding et al. (2019) Ding S, Du W, Zhao X, Wang L, Jia W (2019) A new asynchronous reinforcement learning algorithm based on improved parallel PSO. Appl Intell 49(12):4211–4222, DOI 10.1007/s10489019014874, URL https://doi.org/10.1007/s10489019014874
 Everett and Roberts (2018) Everett R, Roberts S (2018) Learning against nonstationary agents with opponent modelling and deep reinforcement learning. In: 2018 AAAI Spring Symposium Series
 Hadoux et al. (2014) Hadoux E, Beynier A, Weng P (2014) Sequential DecisionMaking under Nonstationary Environments via Sequential Changepoint Detection. In: Learning over Multiple Contexts (LMCE), Nancy, France
 Hallak et al. (2015) Hallak A, Castro DD, Mannor S (2015) Contextual Markov Decision Processes. In: Proceedings of the 12th European Workshop on Reinforcement Learning (EWRL 2015)
 Harel et al. (2014) Harel M, Mannor S, ElYaniv R, Crammer K (2014) Concept Drift Detection Through Resampling. In: International Conference on Machine Learning, pp 1009–1017
 Iwashita and Papa (2019) Iwashita AS, Papa JP (2019) An Overview on Concept Drift Learning. IEEE Access 7:1532–1547, DOI 10.1109/ACCESS.2018.2886026
 Jaksch et al. (2010) Jaksch T, Ortner R, Auer P (2010) Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11:1563–1600
 Kaplanis et al. (2019) Kaplanis C, et al. (2019) Policy consolidation for continual reinforcement learning. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol 97, pp 3242–3251
 Kemker et al. (2018) Kemker R, et al. (2018) Measuring catastrophic forgetting in neural networks. In: Thirtysecond AAAI conference on artificial intelligence
 Kolomvatsos and Anagnostopoulos (2017) Kolomvatsos K, Anagnostopoulos C (2017) Reinforcement learning for predictive analytics in smart cities. In: Informatics, Multidisciplinary Digital Publishing Institute, vol 4, p 16
 Konda and Tsitsiklis (2003) Konda VR, Tsitsiklis JN (2003) On ActorCritic Algorithms. SIAM Journal on Control and Optimization 42(4):1143–1166
 Krawczyk and Cano (2018) Krawczyk B, Cano A (2018) Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Applied Soft Computing 68:677 – 692, DOI https://doi.org/10.1016/j.asoc.2017.12.008
 Levin et al. (2006) Levin DA, Peres Y, Wilmer EL (2006) Markov Chains and Mixing Times. American Mathematical Society
 Liebman et al. (2018) Liebman E, Zavesky E, Stone P (2018) A Stitch in Time  Autonomous Model Management via Reinforcement Learning. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems, AAMAS ’18, p 990–998
 Matteson and James (2014) Matteson DS, James NA (2014) A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data. Journal of the American Statistical Association 109(505):334–345
 Minka (2000) Minka T (2000) Estimating a Dirichlet distribution
 Mohammadi and AlFuqaha (2018) Mohammadi M, AlFuqaha A (2018) Enabling cognitive smart cities using big data and machine learning: Approaches and challenges. IEEE Communications Magazine 56(2):94–101, DOI 10.1109/MCOM.2018.1700298
 Nagabandi et al. (2018) Nagabandi A, et al. (2018) Learning to Adapt: MetaLearning for ModelBased Control. CoRR abs/1803.11347, URL http://arxiv.org/abs/1803.11347
 Niroui et al. (2019) Niroui F, Zhang K, Kashino Z, Nejat G (2019) Deep Reinforcement Learning Robot for Search and Rescue Applications: Exploration in Unknown Cluttered Environments. IEEE Robotics and Automation Letters 4(2):610–617, DOI 10.1109/LRA.2019.2891991
 Ortner et al. (2019) Ortner R, Gajane P, Auer P (2019) Variational Regret Bounds for Reinforcement Learning. In: Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence
 Page (1954) Page ES (1954) Continuous Inspection Schemes. Biometrika 41(1/2):100–115
 Prabuchandran et al. (2013) Prabuchandran KJ, Meena SK, Bhatnagar S (2013) Qlearning based energy management policies for a single sensor node with finite buffer. Wireless Communications Letters, IEEE 2(1):82–85, DOI 10.1109/WCL.2012.112012.120754
 Prabuchandran et al. (2019) Prabuchandran KJ, Singh N, Dayama P, Pandit V (2019) Change Point Detection for Compositional Multivariate Data. arXiv 1901.04935
 Prashanth and Bhatnagar (2011) Prashanth LA, Bhatnagar S (2011) Reinforcement learning with average cost for adaptive control of traffic lights at intersections. In: 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp 1640–1645, DOI 10.1109/ITSC.2011.6082823
 Puterman (2005) Puterman ML (2005) Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edn. John Wiley & Sons, Inc., New York, NY, USA
 Roveri (2019) Roveri M (2019) Learning DiscreteTime Markov Chains Under Concept Drift. IEEE Transactions on Neural Networks and Learning Systems 30(9):2570–2582, DOI 10.1109/TNNLS.2018.2886956
 Salkham and Cahill (2010) Salkham A, Cahill V (2010) Soilse: A decentralized approach to optimization of fluctuating urban traffic using Reinforcement Learning. In: 13th International IEEE Conference on Intelligent Transportation Systems, pp 531–538, DOI 10.1109/ITSC.2010.5625145
 Shiryaev (1963) Shiryaev A (1963) On Optimum Methods in Quickest Detection Problems. Theory of Probability & Its Applications 8(1):22–46
 da Silva et al. (2006) da Silva BC, Basso EW, Bazzan ALC, Engel PM (2006) Dealing with NonStationary Environments Using Context Detection. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, ICML ’06, p 217–224, DOI 10.1145/1143844.1143872
 Sutton and Barto (2018) Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge, MA, USA
 Sutton et al. (1999) Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy Gradient Methods for Reinforcement Learning with Function Approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp 1057–1063
 Tatbul et al. (2018) Tatbul N, Lee TJ, Zdonik S, Alam M, Gottschlich J (2018) Precision and Recall for Time Series. In: Advances in Neural Information Processing Systems, pp 1920–1930
 Tijsma et al. (2016) Tijsma AD, Drugan MM, Wiering MA (2016) Comparing exploration strategies for qlearning in random stochastic mazes. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp 1–8, DOI 10.1109/SSCI.2016.7849366
 Watkins and Dayan (1992) Watkins CJ, Dayan P (1992) Qlearning. Machine learning 8(34):279–292
 Yu and Mannor (2009) Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: 2009 International Conference on Game Theory for Networks, pp 314–322, DOI 10.1109/GAMENETS.2009.5137416
 Zhao et al. (2019) Zhao X, et al. (2019) Applications of Asynchronous Deep Reinforcement Learning Based on Dynamic Updating Weights. Applied Intelligence 49(2):581–591, DOI 10.1007/s104890181296x, URL https://doi.org/10.1007/s104890181296x