Reinforcement Learning under Drift

  • 2019-06-07 06:32:25
  • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
  • 6


We propose algorithms with state-of-the-art \emph{dynamic regret} bounds forun-discounted reinforcement learning under drifting non-stationarity, whereboth the reward functions and state transition distributions are allowed toevolve over time. Our main contributions are: 1) A tuned Sliding WindowUpper-Confidence bound for Reinforcement Learning with Confidence-Widening(\texttt{SWUCRL2-CW}) algorithm, which attains low dynamic regret boundsagainst the optimal non-stationary policy in various cases. 2) TheBandit-over-Reinforcement Learning (\texttt{BORL}) framework that furtherpermits us to enjoy these dynamic regret bounds in a parameter-free manner.


Quick Read (beta)

Reinforcement Learning under Drift

Wang Chi Cheung

Department of Industrial Systems Engineering and Management, National University of Singapore [email protected]

David Simchi-Levi

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]

Ruihao Zhu

Statistics and Data Science Center, Massachusetts Institute of Technology, Cambridge, MA 02139, [email protected]

We propose algorithms with state-of-the-art dynamic regret bounds for un-discounted reinforcement learning under drifting non-stationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence-Widening (SWUCRL2-CW) algorithm, which attains low dynamic regret bounds against the optimal non-stationary policy in various cases. 2) The Bandit-over-Reinforcement Learning (BORL) framework that further permits us to enjoy these dynamic regret bounds in a parameter-free manner.

Key words: non-stationary reinforcement learning, Markov decision process, parameter-free algorithm


Consider a discrete-time Markovian decision process (MDP) where a decision-maker (DM) interacts with a system iteratively: in each round, the DM first observes the current state of the system, and then picks an available action. Afterwards, it receives an instant random reward, and the system transits to the next state according to some state transition distribution. The reward distribution and the state transition distribution depend on the current state and the chosen action, but are independent of all the previous states and actions. The goal of the DM is to maximize its cumulative rewards under the following challenges:

  • Uncertainty: the reward and the state transition distributions are initially unknown to the DM.

  • Non-stationarity: the environment is non-stationary, and both of the reward distributions and the state transition distributions can evolve over time.

  • Partial feedback: the DM can only observe the reward and state transition resulted by the current state and the chosen action in each round.

In fact, many applications, such as inventory control (Bertsekas 2017) and transportation (Zhang and Wang 2018, Qin et al. 2019), can be modeled by this general framework.

Under stationarity, this problem can be solved by the classical Upper-Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010). Unfortunately, the strategies for the stationary setting can deteriorate in non-stationarity environments as historical data “expires”. To address this shortcoming, (Jaksch et al. 2010, Gajane et al. 2018) further consider a switching MDP setting where the MDP is piecewise-stationary, and propose solutions for it. Under the special case of multi-armed bandit (MAB), where the MDP has only one state, there is a recent stream of research initiated by (Besbes et al. 2014) that studies the so-called drifting environment (Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019), in which the reward of each action can change arbitrarily over time, but the total change (quantified by a suitable metric) is upper bounded by some variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions.

In this paper, we generalize the concept of “drift” from MAB to RL, i.e., the reward and state transition distributions can shift gradually as long as their T rounds total changes are bounded by Br and Bp, respectively. We then design and analyze novel algorithms for RL in a drifting environment. Let S and A be the number of states and actions for the MDP, our main contributions are listed as follows.

  • When the variation budgets are known, we develop a Sliding Window UCRL2 with Confidence Widening (SWUCRL2-CW) algorithm with O~(Dmax(Br+Bp+1)1/3S2/3A1/3T2/3) dynamic regret bound under the time-invariant variation budget assumption that is similar to, but less restive than (Yu and Mannor 2009).

  • We identify a unique challenge in RL under drift: existing works for switching MDP settings (Jaksch et al. 2010, Gajane et al. 2018) estimate unknown parameters by averaging historical data in a “forgetting” fashion, and crucially exploit the piecewise stationary environment to achieve low regret. But in non-stationary settings in general, the diameter (a complexity measure to be defined in Section id1) of the MDP estimated in this manner can grow wildly, and may result in unfavorable dynamic regret bounds for drifting environments due to a lack of piecewise stationarity. This is further discussed in Section 3 and Section id1. We overcome this with our novel confidence widening technique when the variations are uniformly bounded over time. We also show that one can bypass this difficulty for many realistic cases stemmed from inventory control or queuing systems under mild assumption.

  • When the variation budgets are unknown, we propose the Bandit-over-Reinforcement Learning (BORL) framework that tunes the SWUCRL2-CW algorithm adaptively, and hence enjoys a parameter free O~(Dmax(Br+Bp+1)1/3S2/3A1/3T2/3+DmaxS1/2A1/4T3/4) dynamic regret bound.

Learning un-discounted MDPs under stationarity has been studied in (Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b) among others. For non-stationary MDPs, the stream of works (Even-Dar et al. 2005, Nilim and Ghaoui 2005, Xu and Mannor 2006, Dick et al. 2014) consider settings with either changing reward functions or transition kernels, but not both. In contrast, (Yu and Mannor 2009) allows arbitrary changes in reward, but (globally) bounded changes in the transition kernels, and design algorithms under additional Markov chain mixing assumptions. (Jaksch et al. 2010, Gajane et al. 2018) proposes solutions for the switching setting. (Abbasi-Yadkori et al. 2013) consider learning MDPs in an adversarial environment with full information feedback.

For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002, Garivier and Moulines 2011, Besbes et al. 2014, Keskin and Zeevi 2016) propose several “forgetting” strategies for different settings. More recently, the works by (Jadbabaie et al. 2015, Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019, Chen et al. 2019) design parameter-free algorithms for non-stationary online learning.

An instance of non-stationary oline MDP is specified by the tuple (𝒮,𝒜,T,r,p). The set 𝒮 is a finite set of states. The collection 𝒜={𝒜s}s𝒮 contains a finite action set 𝒜s for each state s𝒮. The quantity T is the total number of time steps. The quantity r={rt}t=1T is a sequence of mean rewards. For each t, we have rt={rt(s,a)}s𝒮,a𝒜s, where rt(s,a)[0,1] for each s,a. The quantity p={pt}t=1T is a sequence of transition kernels. For each t, we have pt={pt(|s,a)}s𝒮,a𝒜s, where pt(|s,a) is a probability distribution over 𝒮 for each s,a. The quantities pt,rt vary across different t’s in general. We elaborate on their temporal variations in Section 1 when we provide the performance guarantee of our algorithm according to the degree of non-stationarity.

Dynamics. We consider a DM who faces an online non-stationary MDP instance (𝒮,𝒜,T,p,r). The DM knows 𝒮,𝒜,T, but not p,r. It starts at an arbitrary state s1𝒮. At time t, three events happen. First, the DM observes its current state st. Second, it takes an action at𝒜st. Third, given st,at, it stochastically transits to another state st+1 which is distributed as pt(|st,at), and received a stochastic reward Rt(st,at), which is 1-sub-Gaussian with mean rt(st,at). In the second event, the choice of at is based on a non-anticipatory policy Π. That is, the choice only depends on the current state st and the previous observations Ht-1:={sq,aq,Rq(sq,aq)}q=1t-1.

Objective. The DM aims to maximize the cumulative expected reward 𝔼[t=1Trt(st,at)], despite the model uncertainty on pt,rt’s and the non-stationarity of the learning environment. To measure the convergence of an online algorithm to optimality, we consider an equivalent objective of dynamic regret minimization. For each time t, let ρt* be the optimum long term reward for the online MDP problem with stationary transition kernel pt and stationary mean reward function rt. To compute ρt*, we refer to the linear programming formulation (18) in Section id1. The DM aims to design policy Π to minimize the dynamic regret

Remark 1

We note that this regret notion recovers the dynamic regret definition in bandit setting when specified to bandits. Different from bandit, the expected total reward received could be much higher than the benchmark as the latter does not take into account the starting state.

We also review the relevant concepts of communicating MDPs and their diameters.

Definition 1 (Hitting Time, Communicating MDPs and Diameters)

Consider a set of states S, a collection A={As}sS of action sets, and a transition kernel p¯={p¯(|s,a)}sS,aAs. For any s,sS and any stationary policy π, the hitting time from s to s under π is the random variable

Λ(s|π,s):=min{t:st+1=s,s1=s,sτ+1p¯(|sτ,π(sτ)) τ},

which can be infinite. We say that (S,A,p¯) constitutes a communicating MDP if and only if

D:=maxs,s𝒮minstationary π𝔼[Λ(s|π,s)]

is finite. The quantity D is the diameter associated with (S,A,p¯).

To enable learning in a non-stationary environment, we make the following reachability assumption.

Assumption 1

For time t{1,,T}, the tuple (S,A,pt) constitutes a communicating MDP with diameter at most Dt. We denote D𝑚𝑎𝑥=maxt{1,,T}Dt.

In this section, we present the SWUCRL2-CW algorithm, which incorporates sliding window estimates (Garivier and Moulines 2011) and a novel confidence widening technique into the UCRL2 algorithm (Jaksch et al. 2010).

The SWUCRL2-CW algorithm first specifies a window length W, and runs in a sequence of episodes that partitions the T time steps. Episode m starts at round τ(m) (in particular τ(1)=1), and ends at the end of round τ(m+1)-1. Throughout an episode m, the DM follows a certain stationary policy π~m. It ceases the mth episode if at least one of the following two criteria is met:

  • The round index t is a multiple of W. This ensures that each episode is at most W time steps long, and prevents choosing actions based on out-of-date information

  • There exists some (state,action) pair (s,a) such that νm(s,a), the number of visits to them within episode m, is at least as many as the total number of visits to them within the W rounds prior to τ(m), i.e., rounds (τ(m)-W)1,,(τ(m)-1). This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.

The combined effect of these two criteria allows the DM to learn a near-optimal policy with historical data from an appropriately sized time window. One important piece of ingredient is the construction of the policy for each episode. To allow learning under non-stationarity, the SWUCRL2-CW algorithm computes the policy π~m for the mth episode based on the history in the previous W rounds before the current episode m, i.e., between round (τ(m)-W)1 and τ(m)-1. The construction of π~m involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence/uncertainty regions Hr,τ(m),Hp,τ(m)(η) for rewards and transition kernels as the inputs, in addition to an error parameter ϵ. We note that the parameter η0 is a certain confidence widening parameter, which contributes to ensure the output MDP of the EVI has a bounded diameter.

For ease of discussion, we first define for each state-action pair s,a and each t belongs to episode m,

Nt(s,a):=q=(τ(m)-W)1t-1𝟏((sq,aq)=(s,a)),Nt+(s,a):=max{1,Nt(s,a)}. (1)

1) Confidence Region for Rewards. For each state-action pair s,a and each t belongs to episode m, we consider the empirical mean estimator


which has mean


The confidence region is Hr,t={Hr,t(s,a)}s𝒮,a𝒜s with

Hr,t(s,a):={r˙[0,1]:|r˙-r^t(s,a)|𝗋𝖺𝖽-rt(s,a)}, (2)

where 𝗋𝖺𝖽-rt(s,a):=2log(4SAT/δ)/Nt+(s,a).

2) Confidence Region and Confidence Widening for Transition Kernels. For each s𝒮,a𝒜s,s𝒮 and each time t in episode m, we also consider the empirical mean estimator

p^t(s|s,a):=1Nt+(s,a)(q=(τ(m)-W)1t-1𝟏(sq=s,aq=a,sq+1=s)). (3)

which has mean

p¯t(s|s,a):=1Nt+(s,a)q=(τ(m)-W)1t-1pq(s|s,a)𝟏(sq=s,aq=a). (4)

The confidence region is Hp,t(ηp)={Hp,t(s,a;ηp)}s𝒮,a𝒜s with

Hp,t(s,a;ηp):={p˙Δ𝒮:p˙()-p^t(|s,a)1𝗋𝖺𝖽-pt(s,a)+η}, (5)

where 𝗋𝖺𝖽-pt(s,a)=22Slog(8ATW/δ)/Nt+(s,a).


[t] SWUCRL2-CW algorithm \[email protected]@algorithmic[1] \StateInput: Time horizon T, state space 𝒮, and action space 𝒜, window size W, confidence widening parameter η[0,1], initial state s1. \StateInitialize t1. \Forepisode m=1,2, \StateSet τ(m)t, νm(s,a)0, and Nτ(m)(s,a) according to Eqn (1), for all s,a.   \StateCompute the confidence regions Hr,τ(m), Hp,τ(m)(η) according to Eqns (2, 5).  \StateCompute a (1/τ(m))-optimal optimistic policy π~m:


t is not a multiple of W and νm(st,π~m(st))<Nτ(m)+(st,π~m(st)) \StateChoose action at=π~m(st), observe reward Rt(st,at) and the next state st+1.\StateUpdate νm(st,at)νm(st,at)+1,tt+1.  \Ift>T \StateThe algorithm is terminated. \EndIf\EndWhile\EndFor

3) Extended Value Iteration (EVI) (Jaksch et al. 2010). The SWUCRL2-CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to near-optimality. We extract (and rephrase) a description of EVI in Appendix id1. EVI inputs the confidence regions Hr,Hp(η) for the rewards and the transition kernels. It then outputs an “optimistic MDP model”, which consists of reward vector and transition kernel under which the optimal average gain is the largest among all reward vectors and transition kernels in the supplied confidence regions. Specifically, it works as follows:

  • Input: confidence regions Hr for r, Hp for p, and an arbitrarily small error parameter ϵ>0.

  • Output: The returned policy π~ and the auxiliary output (r~,p~,ρ~,γ~). Here, r~, p~, and ρ~ are the selected “optimistic” reward vector, transition kernel, and the corresponding long term average reward; While γ~𝐑0𝒮 is the bias vector (Jaksch et al. 2010). Collectively, we express EVI(Hr,Hp,ϵ)=(r~,p~,ρ~,γ~,π~). Note that γ~0 but γ~(s)=0 for some s.

The output of the EVI further satisfy the following two properties.

Property 1

The dual variables (ρ~,γ~) is optimistic, i.e.,

Property 2

For each state sS, we have


Consequently, π~ is an optimal (up to an additive factor of ϵ) deterministic policy for MDP (S,A,p~,r~).

With these, the formal description of the SWUCRL2-CW algorithm is shown in Algorithm id1.

We are now ready to analyze the performance of the SWUCRL2-CW algorithm. As we are under a changing environment, the dynamic regret bound of SWUCRL2-CW algorithm shall adapt to the underlying shifts (Besbes et al. 2014). We thus define the following variation measures Br,Bp for the drifts in rewards and transition kernels across time:

Br=t=1T-1Br,t,Br,t maxs𝒮,a𝒜s{|rt+1(s,a)-rt(s,a)|}, (6)
Bp=t=1T-1Bp,t,Bp,t maxq[T-1],s𝒮,a𝒜s{pq+1(|s,a)-pq(|s,a)1}. (7)

One can interpret the quantities Br,t,Bp,t as an upper bound on the variations in rewards and transition kernels between round t and t+1, and the quantities Br,Bp 11endnote: 1 Clearly, we cannot hope to achieve a dynamic regret sublinear in T if Br or Bp is Ω(T), so we focus on the case when Br,Bpo(T). are the total allowable variations for rewards and transition kernels in T rounds. Similar to (but less restrictive than) (Yu and Mannor 2009), we further make the following assumption to imposes the transition kernel changes slowly at a steady rate across time.

Assumption 2

The transition kernel’s variation budgets are uniform over time, i.e.,


Together with the confidence widening technique, we are guaranteed the resulted MDP of the EVI has a bounded diameter. For ease of exposition, we also define the piecewise variations for each t belongs to episode m,


To proceed, we introduce two events r,p, which state that the estimated reward and transition kernels lie in the confidence region.


We prove that r,p hold with high probability.

Lemma 1

We have Pr[Er]1-δ/2, Pr[Ep]1-δ/2.

The proof is in Section id1 of the appendix. We then bound the dynamic regret of each round.

Proposition 1

Conditioning on events Er,Ep and assuming that ηBpW/2T, for every episode m and every time t in the episode (i.e. t{τ(m),,τ(m+1)-1}) the following inequality holds for certainty:

+ [2𝗏𝖺𝗋r,t+4Dτ(m)(𝗏𝖺𝗋p,t+η)]+[2𝗋𝖺𝖽-rτ(m)(st,at)+4Dτ(m)𝗋𝖺𝖽-pτ(m)(s,a)].

The complete proof can be found in Section id1 of the appendix. Consider time t, which belongs episode m. In a high level, the proof goes through three steps:

  1. 0.

    The estimated mean reward r¯τ(m)(s,a) (used in computing π~m) and the true reward rt(s,a) differ by at most 𝗏𝖺𝗋r,t.

  2. 0.

    With the widened confidence regions Hp,τ(m)(η), we are guaranteed that there exists a MDP, i.e., (𝒮,𝒜,pτ(m)), with diameter at most Dτ(m) in Hp,τ(m)(η). By Lemma 2 and the dual formulation of optimal reward (19), we have the optimistic long term reward ρ~τ(m) induced by π~m is at least ρt*-𝗏𝖺𝗋r,t-2Dτ(m)𝗏𝖺𝗋p,t.

  3. 0.

    Finally, we relate the optimistic transition kernel p~τ(m) and the underlying kernel pt to account for the extra loss due to a widened confidence region. Unifying all three steps, we can conclude the statement.

Combining the above, we can conclude the statement.  \@endparenv Suppose we denote S:=|𝒮|, A:=1Ss𝒮|𝒜s|, and use the O~() notation to hide logarithmic dependence on Dmax,S,A,T and the confidence parameter δ for defining the confidence regions Hr,t,Hp,t(η), our first main result is a dynamic regret upper bound on the SWUCRL2-CW algorithm.

Theorem 1

Assuming S>1, the SWUCRL2-CW algorithm with window size W and ηBpW/(2T) satisfies the dynamic regret bound

Dyn-RegT(SWUCRL2-CW)=O~(BrW+D𝑚𝑎𝑥[BpW+SATW+Tη+SATW+T]) (8)

with probability 1-O(δ). If we further put

W=W*=S23A13T23/(Br+Bp+1)23, (9)
η=η*=(Bp+1)W2T=S23A13(Bp+1)2T13(Br+Bp+1)23, (10)

this is


The complete proof of Theorem 1 is presented in Section id1 of the appendix.

Remark 2

When S=1, our problem model specializes to the non-stationary bandit problem studied by (Besbes et al. 2014). In this case, we have D𝑚𝑎𝑥=0, and we are left with the first term in (8). By choosing W=W*=A1/3T2/3/max{Br,1}2/3, our algorithm has dynamic regret O~(Br1/3A1/3T2/3), matching the minimax optimal dynamic regret bound by (Besbes et al. 2014).

Remark 3

Different from (Cheung et al. 2019), there is no straightforward way of setting W,η to get a non-trivial bound when Bp,Br are not known. While (Cheung et al. 2019) provide a way to set their window size (oblivious of their variational budget Br) so that the dynamic regret is O~(BrT2/3) in their linear bandit setting, in our setting we still need to have ηBpW/(2T), which is a prior not clear how to ensure when we don’t know Bp.

We now pause for a while to comment on Assumption 2 and the technique of confidence widening.

In online stochastic environment, one usually take time average of observed samples to estimate a certain latent quantity, even when the sample distributions vary with time. This has been proved to work well in the non-stationary bandit settings Garivier and Moulines (2011), Cheung et al. (2019). For online MDPs, one typically look at the time average MDP p^t in (3), which estimate p¯t in (4) to within an additive error O~(1/Nt+(s,a)) for any pair of (s,a). In the case of stationary MDPs where t[T]pt=p, one has p¯t=p, and thus can conclude that the un-widened confidence region Hp,t(0) contains p with high probability. An immediate consequence is that the EVI w.r.t. Hp,t(0) would return a policy with bounded difference bias vector γ~ as p has diameter D (Please see Section 4.3 of (Jaksch et al. 2010)). These further ensure that the optimistic long term average reward is not far away from the true long term average reward, e.g., step 2) in the proof of Proposition 1.

Nevertheless, this is not always the case under non-stationarity. Although (Jaksch et al. 2010, Gajane et al. 2018) uses Hp,t(0) for piecewise stationary MDP setting, they crucially exploit the fact that the MDP remain unchanged between jumps, and can treat the problem as if it is stationary. For changing environments, one can only guarantee that the transition kernel p¯t={p¯t(|s,a)}s𝒮,a𝒜sHp,t(0), with high probability, but unsure about the true pt’s due to the drift. In Section id1 of the appendix, we show that D, the diameter of (𝒮,𝒜,p¯t), can grow as Ω(W/logW), and the EVI w.r.t. Hp,t(0) can only promise a policy with bias vector bounded by Ω(W/logW), which makes the dynamic regret bound vacuous for drifting environments. By assumption 2 and the confidence widening technique, we are guaranteed that pτ(m)Hp,τ(m)(η) for each episode m, and can proceed as what we have done.

Alternatively, if there exists ζ>0 such that for any states s,s𝒮, there is always an action as,s𝒜s with pt(s|s,a(s,s))ζt[T], then the MDP (𝒮,𝒜,p¯t) has diameter D¯:=1/ζ, and it can shown that SWUCRL2-CW algorithm with η=0 enjoys a dynamic regret bound O~(D¯(Br+Bp+1)1/3S2/3A1/3T2/3) by similar techniques. As we shall shown in Section id1, this assumption can be easily satisfied in many realistic applications.

As pointed out by Remark 3, in the case of unknown Br and Bp, the DM cannot implement the SWUCRL2-CW algorithm as the magnitude of confidence widening cannot be determined. To handle this case, we wish to design an online algorithm that can attain reasonable dynamic regret bound in a parameter-free manner. By Theorem 1, we are assured: under Assumption 2, a fixed pair of parameters (W*,η*) can ensure low regret. For the bandit setting, (Cheung et al. 2019) proposes the bandit-over-bandit framework that uses a separate copy of EXP3 algorithm to tune the window length. Inspired by it, we develop a novel Bandit-over-Reinforcement Learning (BORL) algorithm with parameter-free O~(Dmax(Br+Bp+1)13S23A13T23+DmaxS12A14T34) dynamic regret here.

Following a similar line of reasoning as (Cheung et al. 2019), we make use of the SWUCRL2-CW algorithm as a sub-routine, and “hedge” (Bubeck and Cesa-Bianchi 2012) against the (possibly adversarial) changes of rt’s and pt’s to identify a reasonable fixed window length and confidence widening parameter.

As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into T/H blocks of equal length H rounds (the length of the last block can H), and specifies a set J from which each pair of (window length, confidence widening) parameter are drawn from. For each block i[T/H], the BORL algorithm first calls some master algorithm to select a pair of (window length, confidence widening) parameters (Wi,ηi)(J), and restarts the SWUCRL2-CW algorithm with the selected parameters as a sub-routine to choose actions for this block. Afterwards, the total reward of block i is fed back to the master, and the “posterior” of these parameters are updated accordingly.

One immediate challenge not presented in the bandit setting (Cheung et al. 2019) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in bandit setting where there is only one sate, but fortunately, the state is observed before the starting of a block. To this end, we use the EXP3.P algorithm for multi-armed bandit against an adaptive adversary (Auer et al. 2002, Bubeck and Cesa-Bianchi 2012) as the master algorithm.

Figure 1: Structure of the BORL algorithm

We are now ready to state the details of the BORL algorithm. For some fixed choice of block length H (to be determined later), we first define a couple of additional notations:

Φ=S2/3A1/32T1/3,ΔW=lnH,Δη=lnΦ-1,Δ=(ΔW+1)(Δη+1), (11)

Here, JW and Jη are all possible choices of window length and confidence widening parameter, respectively, and J is the Cartesian product of them with |J|=Δ. We emphasize that due to the restarting, any instance of the SWUCRL2-CW algorithm cannot last for more than H rounds. Consequently, even if the EXP3.P selects a window length W>H, the effective window length is H, and we make J[H]. We also let 𝖱(W,η,s) be the total rewards for running the SWUCRL2-CW algorithm with window length W and confidence widening parameter η for H rounds starting from state s,

The EXP3.P algorithm (Bubeck and Cesa-Bianchi 2012) treats each element of J as an arm. It begins by initializing

α=0.95lnΔΔT/H,β=lnΔΔT/H,γ=1.05ΔlnΔT/H,q(j,k),1=0(j,k)M, (12)

where M={(j,k):j{0,1,,ΔW},k{0,1,,Δη}}. At the beginning of each block i[T/H], the BORL algorithm first sees the state s(i-1)H+1, and computes

(j,k)M,u(j,k),i=(1-γ)exp(αq(j,k),i)(j,k)Mexp(αq(j,k),i)+γΔ. (13)

Then it sets (ji,ki)=(j,k) with probability u(j,k),i(j,k)M. The selected pair of parameters are thus Wi=Hji/ΔW and ηi=Φki/Δη. Afterwards, the BORL algorithm starts from state s(i-1)H+1, selects actions by running the SWUCRL2-CW algorithm with window length Wi and confidence widening parameter ηi for each round t in block i. At the end of the block, the BORL algorithm observes the total rewards 𝖱(Wi,ηi,s(i-1)H+1). As a last step, it rescales 𝖱(Wi,ηi,s(i-1)H+1) by dividing it by H so that it is within [0,1], and updates

(j,k)Mq(j,k),i+1=q(j,k),i+β+𝟏(j,k)=(ji,ki)𝖱(Wi,ηi,s(i-1)H+1)/Hu(j,k),i. (14)

The formal description of the BORL algorithm (with H defined in the next subsection) is shown in Algorithm the-at-equationgroup-at-IDh. {algorithm}[!ht] BORL algorithm \[email protected]@algorithmic[1] \StateInput: Time horizon T, state space 𝒮, and action space 𝒜, initial state s1. \StateInitialize HSAT, Φ,ΔW,Δη,Δ,JW,Jη according to eq. (11), and α,β,γ according to eq. (12). \StateM{(j,k):j{0,1,,ΔW},k{0,1,,Δη}},q(j,k),10(j,k)M. \Fori=1,2,,T/H \StateDefine distribution (u(j,k),i)(j,k)M according to eq. (13), and set (ji,ki)(j,k) with probability u(j,k),i.\StateWiHji/ΔW,ηiΦki/Δη. \Fort=(i-1)H+1,,iHT \StateRun the SWUCRL2-CW algorithm with window length Wi and blow up parameter ηi, and observe the total rewards 𝖱(Wi,ηi,s(i-1)H+1). \EndFor\StateUpdate q(j,k),i+1 according to eq. (14). \EndFor

To analyze the performance of the BORL algorithm, we consider the following regret decomposition, for any choice of (W,η)J, we have

Dyn-Reg(𝙱𝙾𝚁𝙻,T)= i=1T/H𝔼[t=(i-1)H+1iHTρt*-𝖱(Wi,ηi,s(i-1)H+1)]
= i=1T/H𝔼[t=(i-1)H+1iHTρt*-𝖱(W,η,s(i-1)H+1)]
+ i=1T/H𝔼[i=1T/H𝖱(W,η,s(i-1)H+1)-𝖱(Wi,ηi,s(i-1)H+1)]. (15)

For the first term in eq. (15), we can apply the results from Theorem 1 to each block iT/H, i.e.,

= O~(Br(i)W+Dmax[Bp(i)W+SAH/W+Hη+SAH/W+H]), (16)

where we have defined


for brevity. For the second term, it captures the additional rewards of the DM were it uses the fixed parameters (W,η) throughout w.r.t. the trajectory on the starting states of each block by the BORL algorithm, i.e., s1,,s(i-1)H+1,,s(T/H-1)H+1, and this is exactly the regret of the EXP3.P algorithm when it is applied to a Δ-arm adaptive adversarial bandit problem with reward from [0,H]. Therefore, for any choice of (W,η), we can upper bound this by


as Δ=O(ln2T). Summing these two, the regret of the BORL algorithm is

Dyn-RegT(𝙱𝙾𝚁𝙻)=O~(BrW+Dmax[BpW+SATW+Tη+SATW+TH]). (17)

We now point out a key trade-off in the choice of H:

  • On one hand, H should be small enough so that the regret bound in eq. (17) is small.

  • On the others, H should be large to allow W to get close to W* even when Br+Bp is small.

To this end, we pick


We also justify the choice of H, and formalize the dynamic regret bound of the BORL algorithm as follows.

Theorem 2

Assume that S>1, the dynamic regret bound of the BORL algorithm is


with probability 1-O(δ).

The complete proof can be found in Section id1 of the Appendix.


  • Abbasi-Yadkori et al. (2013) Abbasi-Yadkori, Yasin, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvári. 2013. Online learning in markov decision processes with adversarially chosen transition probability distributions. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS).
  • Agrawal and Jia (2017) Agrawal, Shipra, Randy Jia. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 1184–1194.
  • Auer et al. (2002) Auer, P., N. Cesa-Bianchi, Y. Freund, R. Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, Vol. 32, No. 1 : pp. 48–77.
  • Bartlett and Tewari (2009) Bartlett, Peter L., Ambuj Tewari. 2009. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009. 35–42.
  • Bertsekas (2017) Bertsekas, Dimitri. 2017. Dynamic Programming and Optimal Control. Athena Scientific.
  • Besbes et al. (2014) Besbes, Omar, Yonatan Gur, Assaf Zeevi. 2014. Stochastic multi-armed bandit with non-stationary rewards. Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS).
  • Bubeck and Cesa-Bianchi (2012) Bubeck, S., N. Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 2012, Vol. 5, No. 1: pp. 1–122.
  • Chen et al. (2019) Chen, Yifang, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei. 2019. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Proceedings of Conference on Learning Theory (COLT).
  • Cheung et al. (2019) Cheung, Wang Chi, David Simchi-Levi, Ruihao Zhu. 2019. Learning to optimize under non-stationarity. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Dick et al. (2014) Dick, Travis, András György, Csaba Szepesvári. 2014. Online learning in markov decision processes with changing cost sequences. Proceedings of the International Conference on Machine Learning (ICML).
  • Even-Dar et al. (2005) Even-Dar, Eyal, Sham M Kakade, , Yishay Mansour. 2005. Experts in a markov decision process. Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS).
  • Fruit et al. (2018a) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric. 2018a. Near optimal exploration-exploitation in non-communicating markov decision processes. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett, eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2998–3008.
  • Fruit et al. (2018b) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner. 2018b. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. Jennifer Dy, Andreas Krause, eds., Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 1578–1586.
  • Gajane et al. (2018) Gajane, Pratik, Ronald Ortner, Peter Auer. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. CoRR abs/1805.10066. URL
  • Garivier and Moulines (2011) Garivier, A., E. Moulines. 2011. On upper-confidence bound policies for switching bandit problems. Proceedings of International Conferenc on Algorithmic Learning Theory (ALT).
  • Hoeffding (1963) Hoeffding, Wassily. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301) 13–30.
  • Jadbabaie et al. (2015) Jadbabaie, Ali, Alexander Rakhlin, Shahin Shahrampour, Karthik Sridharan. 2015. Online optimization : Competing with dynamic comparators. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Jaksch et al. (2010) Jaksch, Thomas, Ronald Ortner, Peter Auer. 2010. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11 1563–1600.
  • Karnin and Anava (2016) Karnin, Z., O. Anava. 2016. Multi-armed bandits: Competing with optimal sequences. Procedding of Annual Conference on Neural Information Processing Systems (NIPS).
  • Keskin and Zeevi (2016) Keskin, N., A. Zeevi. 2016. Chasing demand: Learning and earning in a changing environments. Mathematics of Operations Research, 2016, 42(2), 277–307.
  • Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit Algorithms. Cambridge University Press.
  • Luo et al. (2018) Luo, H., C. Wei, A. Agarwal, J. Langford. 2018. Efficient contextual bandits in non-stationary worlds. Proceedings of Conference on Learning Theory (COLT).
  • Nilim and Ghaoui (2005) Nilim, Arnab, Laurent El Ghaoui. 2005. Robust control of markov decision processes with uncertain transition matrices. Operations Research.
  • Qin et al. (2019) Qin, Zhiwei (Tony), Jian Tang, Jieping Ye. 2019. Deep reinforcement learning with applications in transportation. Tutorial of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19).
  • Weissman et al. (2003) Weissman, Tsachy, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, , Marco L. Weinberger. 2003. Inequalities for the l1 deviation of the empirical distribution. Technical Report HPL-2003-97, HP Laboratories Palo Alto:
  • Xu and Mannor (2006) Xu, Huan, Shie Mannor. 2006. The robustness-performance tradeoff in markov decision processes. Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS).
  • Yu and Mannor (2009) Yu, Jia Yuan, Shie Mannor. 2009. Online learning in markov decision processes with arbitrarily changing rewards and transitions. Proceedings of the International Conference on Game Theory for Networks.
  • Zhang and Wang (2018) Zhang, Anru, Mengdi Wang. 2018. Spectral state compression of markov processes.

Appendix. Supplementary

The optimal long term reward ρt* is equal to the optimal value of the linear program 𝖯(rt,pt). For a reward vector r and a transition kernel p, we define

𝖯(r,p)max s𝒮,a𝒜sr(s,a)x(s,a) (18)
s.t. a𝒜sx(s,a)=s𝒮,a𝒜sp(s|s,a)x(s,a) s𝒮
x(s,a)0 s𝒮,a𝒜s

Throughout our analysis, it is useful to consider the following dual formulation 𝖣(r,p) of the optimization problem 𝖯(r,p):

𝖣(r,p)min ρ (19)
s.t. ρ+γ(s)r(s,a)+s𝒮p(s|s,a)γ(s) s𝒮,a𝒜s
ϕ,γ(s) free s𝒮.

The following Lemma shows that any feasible solution to 𝖣(r,p) is essentially bounded if the underlying MDP is communicating, which will be crucial in the subsequent analysis.

Lemma 2

Let (ρ,γ) be a feasible solution to the dual problem 𝖣(r,p), where (S,A,p) consititute a communicating MDP with diameter D. We have


The Lemma is extracted from Section 4.3.1 of (Jaksch et al. 2010), and it is more general than (Lattimore and Szepesvári 2018), which requires (ρ,γ) to be optimal instead of just feasible.

We provide the pseudo-codes of the extended value iteration proposed by (Jaksch et al. 2010), displayed in Algorithm id1. By (Jaksch et al. 2010), the algorithm converges in finite time when Hp contains a transition kernel p such that (𝒮,𝒜,p) constitutes a communicating MDP. {algorithm}[!ht] EVI(Hr,Hp;ϵ), mostly extracted from (Jaksch et al. 2010) \[email protected]@algorithmic[1] \StateInitialize VI record u0𝒮 as u0(s)=0 for all s𝒮. \Fori=0,1, \StateFor each s𝒮, compute VI record ui+1(s)=maxa𝒜sΥ~i(s,a), where


Define stationary policy π~:𝒮𝒜s as π~(s)=argmaxa𝒜sΥ~i(s,a). \StateDefine optimistic reward r~={r~(s,a)}s,a with r~(s,a)argmaxr˙(s,a)Hr(s,a){r˙(s,a)}. \StateDefine optimistic kernel p~={p~(|s,a)}s,a with p~(|s,a)argmaxp˙Hp(s,a){s𝒮ui(s)p˙(s)}. \StateDefine optimistic dual variables ρ~=maxs𝒮{ui+1(s)-ui(s)}, γ~(s)=ui(s)-mins𝒮ui(s). \Ifmaxs𝒮{ui+1(s)-ui(s)}-mins𝒮{ui+1(s)-ui(s)}ϵ \StateBreak the for loop. \EndIf\EndFor\StateReturn policy π~. \StateAuxiliary output: optimistic reward and kernel (r~,p~), optimistic dual variables (ρ~,γ~).

Appendix. Proofs

We divide the proof into two parts.

The proof makes use of the well known Hoeffding’s inequality (Hoeffding 1963) as stated below.

Proposition 2

Let {Xt}t=1T[0,1]T be a martingale difference sequence adapted to a filtration {Ft}t=0T. That is, Xt is Ft-measurable, and E[Xt|Ft-1]=0. For any δ(0,1), we have


Now, by Hoeffding’s inequality

Pr[|r^t(s,a)-r¯t(s,a)|𝗋𝖺𝖽-rt(s,a)]1-δ2SAT, (20)

and we can further conclude the statement by a simple union bound over all possible choices of (s,a,t)(𝒮,𝒜,[T]).

Unlike the stationary setting, the data in the non-stationary environment is no longer i.i.d., which violates the assumption of (Weissman et al. 2003), and we have to appeal to other approaches to analyze the deviation of the estimated transition kernels. Following a covering argument from (Lattimore and Szepesvári 2018), we first prove the following proposition.

Proposition 3

For any fixed sS,aA, and t[T],


Proof of Proposition 3] We consider only non-trivial case where the (s,a) pairs that has Nt(s,a)1. By definition,

p^t(|s,a)-p¯t(|s,a)1= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a)1,

We follow the covering argument from (Lattimore and Szepesvári 2018)

= supxS:x1p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),x. (21)

Consider the finite subset 𝒞, which is the regular grid with stride 1/(2W) of the set [-1,1]S, and it can be easily verified that for any x[-1,1]S, there exists y𝒞 such that x-y1/(2W). By the Hoeffing’s inequality (Proposition 2) and a union bound over all possible y𝒞 and all possible values of Nt(s,a)[W], we have with probability at least 1-δ, for all y𝒞,

p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y 2log(|𝒞|W/δ)Nt(s,a)
2log((4W)SW/δ)Nt(s,a). (22)

Now we let


and y=argminx𝒞x-x*. Then from eq. (21),

= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),x*
= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y
= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y
= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y+2suppS:p11p,x*-y
= p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y+2x*-y
p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a),y+1W (23)

Here inequality (23) follows from the fact that y is the closest point to x* in 𝒞. From inequality (22), we have with probability at least 1-δ, (23) is upper bounded as

2log((4W)SW/δ)Nt(s,a)+2log((4W)SW/δ)Nt(s,a). (24)

Here, we sequentially exploit the fact that Nt(s,a)W, Nt(s,a)1, and 2log((4W)SW/δ)1. From inequalities (23) and (24), we have with probability at least 1-1/(2SAT) (i.e., setting δ=1/(2SAT))

p^t(|s,a)-q=(τ(m)-W)1t-1pq(|s,a)𝟏(sq=s,aq=a)Nt(s,a)122Slog(8ATW/δ)Nt(s,a). (25)

Now, by an union bound over all possible choices of (s,a,t)(𝒮,𝒜,[T]), we can conclude the statement.

In this section, we prove Proposition 1. In doing so, we need the following auxiliary claims and lemmas

Claim 1

Let t be in episode m, then


holds for every state-action pair (s,a).

Lemma 3

Conditioned on Er,Ep, if ηBpW/(2T), then


for any t in episode m.

Claim 2

Let t be in episode m. Conditioned on Ep and supposing that ηBp,tW/2, for every state-action pair (s,a) we have


Claims 1, 2 and Lemma 3 are proved in Sectionsid1, id1, id1 respectively. To prove Claim 2 and Lemma 3, we further require two auxiliary results, Claim 3 and Lemma 4, which are stated and proved in Section id1.

We have

|rt(s,a)-r¯τ(m)(s,a)| |rt(s,a)-rτ(m)(s,a)|+|rτ(m)(s,a)-r¯τ(m)(s,a)|

By the definition of Br,q, we have



1Ww=1W|rτ(m)-rτ(m)-w| 1Ww=1Wi=1w|rτ(m)-i+1-rτ(m)-i|

Altogether, the claim is shown.

Claim 3 and Lemma 4 are stated as follows.

Claim 3

Let t be in episode m. For every state-action pair (s,a) we have


Proof of Claim 3] The proof is similar to that of Claim 1. \@endparenv

Lemma 4

Let t be in episode m. Conditioned on Ep, and supposing that ηBpW/(2T), we have 0γ~τ(m)(s)2Dτ(m) for each sS with certainty.


Proof of Lemma 4] First, we claim that pτ(m)(|s,a)Hp,t(s,a;η), conditional on p and assuming that ηBpW/(2T). Indeed, we know that p¯τ(m)(|s,a)Hp,t(s,a;0) for each state-action pair. Now, by Claim 3, we also know that |pτ(m)(|s,a)-p¯τ(m)(|s,a)|BpW/(2T)η. Altogether, our first claim is shown. To complete the proof, by Property 1 for the output of EVI, we know that (ρ~τ(m),γ~τ(m)) is feasible to the dual problem 𝖣(rτ~(m),pτ~(m)). Therefore, the required bounds for γ~τ(m) hold by applying Lemma 2. \@endparenv

As argued in the proof of Lemma 4, we know that pτ(m)(|s,a)Hp,t(s,a;η), conditional on p and assuming that ηBpW/(2T). In addition, condition on r, we know that r¯τ(m)(s,a)Hr,t(s,a) for each state action pair s,a. Consequently, by Property 1 of the output of EVI, we know that

ρ~τ(m)+γ~τ(m)(s)r¯τ(m)(s,a)+s𝒮γ~τ(m)(s)pτ(m)(s|s,a), (26)

where γ~τ(m)[0,2Dτ(m)] by Lemma 4.

Now, we claim that (ρ~τ(m)+𝗏𝖺𝗋r,t+2Dτ(m)𝗏𝖺𝗋p,t,γ~τ(m)) is a feasible solution to the tth period dual problem 𝖣(rt,pt), which immediately implies the Lemma. To demonstrate the claim, for every state action pair (s,a) we have

r¯τ(m)(s,a) rt(s,a)-𝗏𝖺𝗋r,t (27)
s𝒮γ~τ(m)(s)pτ(m)(s|s,a) s𝒮γ~τ(m)(s)pt(s|s,a)-γ~τ(m)pt(|s,a)-p¯τ(m)(|s,a)1
s𝒮γ~τ(m)(s)pt(s|s,a)-2Dτ(m)Bp(t-τ(m))T (28)

where step (27) is by Claim 1, and step (28) is by Claim 3 and Lemma 4. Altogether, putting (27), (28) to (26), our claim is shown, and the Lemma is proved.

See 2 \@trivlist

Proof of Claim 2] We have

γ~τ(m)(a)[p~τ(m)(|s,a)-p¯τ(m)(|s,a)1(b)+p¯τ(m)(|s,a)-pt(|s,a)1(c)]. (29)

In step (29), we know that

  • (a)2Dτ(m) by Lemma 4,

  • (b)2𝗋𝖺𝖽-pτ(m)(s,a)+η, by the facts that p~τ(m)(|s,a)Hp,τ(m)(s,a;η) and p¯τ(m)(|s,a)Hp,τ(m)(s,a;0),

  • (c)𝗏𝖺𝗋p,t by Claim 3.

Altogether, the Claim is proved. \@endparenv

Now, we have

r¯τ(m)(st,at)-𝗏𝖺𝗋r,t (30)
r~τ(m)(st,at)-𝗏𝖺𝗋r,t-2𝗋𝖺𝖽-rτ(m)(st,at) (31)
      -𝗏𝖺𝗋r,t-2𝗋𝖺𝖽-rτ(m)(st,at) (32)
-2[𝗏𝖺𝗋r,t+𝗋𝖺𝖽-rτ(m)(st,at)]-2Dτ(m)[2𝗏𝖺𝗋p,t+2𝗋𝖺𝖽-pτ(m)(s,a)+η]. (33)

Step (30) is by Claim 1. Step (31) is by conditioning that event r holds. Step (32) is by Property 2 for the output of EVI. In step (33), we upper bound ρ~τ(m) by Lemma 3 and we upper bound s𝒮p~τ(m)(s|st,at)γ~τ(m)(s) by Claim 2. Rearranging gives the Proposition.

In this section, we present the complete proof of Theorem 1. Using Proposition 1, we can bound the dynamic regret as follows: denoting M(T) as the total number of episodes, and by abusing the notation we let τ(M(T)+1)-1=T. Episode M(T) is interrupted by the break in Line theequation-at-IDj and the algorithm is forced to terminate as the end of time T is reached.

m=1M(T)t=τ(m)τ(m+1)-1{[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]+1τ(m)} ()
+t=1T{2𝗏𝖺𝗋r,t+4Dmax𝗏𝖺𝗋p,t+2Dmaxη} ()
  +m=1M(T)t=τ(m)τ(m+1)-1{2𝗋𝖺𝖽-rτ(m)(st,at)+4Dmax𝗋𝖺𝖽-pτ(m)(st,at)}. ()

We accomplish the promised dynamic regret bound by the following three Lemmas that bound the dynamic regret terms (, , ).

Lemma 5

Conditioning on events Er,Ep and assuming that ηBp,1W/2, with probability 1-O(δ) we have ( )=O~(D𝑚𝑎𝑥[M(T)+T]). Under the same assumption, with certainty we have M(T)SA(2+log2W)T/W=O~(SAT/W).


Proof of Lemma 5] First, to demonstrate the bound for M(T), it suffices to show that there are at most SA(2+log2W) many episodes in each of the following cases: (1) between time steps 1 and W, (2) between time steps jW and (j+1)W, for any j{1,,T/W-1}, (3) between time steps T/WW and T. We focus on case (2), and the edge cases (1, 3) can be analyzed similarly.

Between time steps jW and (j+1)W, a new episode m+1 is started when the second criterion νm(st,π~m(st))<Nτ(m)+(st,π~m(st)) is violated during the current episode m. We first illustrate how the second criterion is invoked for a fixed state-action pair (s,a), and then bound the number of invocations due to (s,a). Now, let’s denote m1,,mL as the episode indexes, where jWτ(m1)<τ(m2)<<τ(mL)<(j+1)W, and the second criterion for (s,a) is invoked during m for 1L. That is, for each {1,,L}, the DM breaks the while loop for episode m because νm(s,a)=Nτ(m)+(s,a), leading to the new episode m+1.

To demonstrate our claimed bound for M(T), we show that L2+log2W as follows. To ease the notation, let’s denote ψ=νm(s,a). We first observe that ψ1=Nτ(m1)+(s,a)1. Next, we note that for {2,L}, we have ψk=1-1ψk. 22endnote: 2 We proceed slightly differently from the stationary case, where the corresponding Nt(s,a) is non-decreasing in t (Jaksch et al. 2010), which is clearly not true here due to the use of sliding windows Indeed, we know that for each we have (τ(m+1)-1)-τ(m1)W, by our assumption on m1,,m. Consequently, the counting sum in Nτ(m)(s,a), which counts the occurrences of (s,a) in the previous W time steps, must have count those occurrences corresponding to ψ1,,ψ-1. The worst case sequence of ψ1,ψ2,,ψL that yields the largest L is when ψ1=ψ2=1, ψ3=2, and more generally ψ=2-2 for 2. Since ψW for all W, we clearly have L2+log2W, proving our claimed bound on L.

Altogether, during the T time steps, there are at most (SAT(2+log2W))/W episodes due to the second criterion and T/W due to the first, leading to our desired bound on M(T).

Next, we establish the bound for (). Now, by virtue of confidence widening and Lemma 2, we know that γ~τ(m)(s)[0,2Dmax] for all m and s by Lemma 4. For each episode m, we have

= -γ~τ(m)(sτ(m))+γ~τ(m)(sτ(m)+1)2Dmax+t=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st+1)]:=Yt. (34)

Summing (34) over m=1,,M(T) we get

m=1M(T)t=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]2DmaxM(T)+t=1TYt. (35)

Define the filtration t-1={(sq,aq,Rq(sq,aq))}q=1t. Now, we know that 𝔼[Yt|t-1]=0, Yt is σ(t)-measurable, and |Yt|2Dmax. Therefore, we can apply the Hoeffding inequality (see Proposition 2) to show that


with probability 1-O(δ). Finally, note that


Hence, the Lemma is proved. \@endparenv

Lemma 6

With certainty, ( )=O((Br+D𝑚𝑎𝑥Bp)W+D𝑚𝑎𝑥Tη).


Proof of Lemma 6] We first bound t=1T𝗏𝖺𝗋r,t. Now, recall the definition that, for time t in episode m, we have defined 𝗏𝖺𝗋r,t=q=τ(m)-Wt-1Br,q. Clearly, for iWq<(i+1)W, the summand Br,q only appears in 𝗏𝖺𝗋r,t for iWq<t(i+2)W, since each episode is contained in {iW,,(i+1)W} by our first criterion in Line theequation-at-IDj in Algorithm id1. Altogether, we have

2t=1T𝗏𝖺𝗋r,t2t=1T-1Br,tW=2BrW. (36)

Next, we bound t=1T𝗏𝖺𝗋p,t. Now, we know that τ(m+1)-τ(m)W by the first criterion in Line theequation-at-IDj in Algorithm id1. Consequently,

4Dmaxt=1T𝗏𝖺𝗋p,t4Dmaxt=1T-12BpWT=8DmaxBpW. (37)

Finally, combining (36, 37) with 2Dmaxt=1Tη, the Lemma is established. \@endparenv

Lemma 7

With certainty, we have ( )=O~((SAT+D𝑚𝑎𝑥SAT)/W).


Proof of Lemma 7] We first show that, with certainty,

m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-rτ(m)(st,at)=m=1M(T)t=τ(m)τ(m+1)-12ln(4SAT/δ)Nτ(m)+(st,at)=O~(SATW), (38)
m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-pτ(m)(st,at)=m=1M(T)t=τ(m)τ(m+1)-122Slog(8ATW/δ)Nτ(m)+(st,at)=O~(SATW). (39)

We analyze by considering the dynamics of the algorithm in each consecutive block of W time steps, in a way similar to the proof of Lemma 5. Consider the episodes indexes m0,m1,mT/W,mT/W+1, where τ(m0)=1, and τ(mj)=jW for j{1,,T/W}, and mT/W+1=m(T)+1 (so that τ(mT/W+1-1) is the time index for the last episode in the horizon).

To prove (38, 39), it suffices to show that, for each j{0,1,,T/W}, we have

m=mjmj+1-1t=τ(m)τ(m+1)-11Nτ(m)+(st,at)=O(SAW). (40)

Without loss of generality, we assume that j{1,,T/W-1}, and the edge cases of j=0,T/W can be analyzed similarly.

Now, we fix a state-action pair (s,a) and focus on the summands in (40):

m=mjmj+1-1t=τ(m)τ(m+1)-1𝟏((st,at)=(s,a))Nτ(m)+(st,at)=m=mjmj+1-1νm(s,a)Nτ(m)+(s,a) (41)

It is important to observe that:

  1. 0.

    It holds that νmj(s,a)Nτ(mj)(s,a), by the first criterion in Line theequation-at-IDj in Algorithm id1,

  2. 0.

    For m{mj+1,,mj+1-1}, we have m=mjm-1νm(s,a)Nτ(m)(s,a) . Indeed, we know that episodes mj,,mj+1-1 are inside the time interval {jW,,(j+1)W}. Consequently, the counts of “(st,at)=(s,a)” associated with {νm(s,a)}m=mjm-1 are contained in the W time steps preceding τ(m), hence the desired inequality.

With these two observations, we have

(41) νmj(s,a)max{νmj(s,a),1}+m=mj+1mj+1-1νm(s,a)max{m=mjm-1νm(s,a),1}
max{νmj(s,a),1}+(2+1)max{m=mjmj+1-1νm(s,a),1} (42)
(2+2)max{t=jW(j+1)W-1𝟏((st,at)=(s,a)),1}. (43)

Step (42) is by Lemma 19 in (Jaksch et al. 2010), which bounds the sum in the previous line. Step (43) is by the fact that episodes mj,,mj+1-1 partitions the time interval jW,,(j+1)W-1, and νm(s,a) counts the occurrences of (st,at)=(s,a) in episode m. Finally, observe that (41)=0 if νm(s,a)=0 for all m{mj,,mj+1-1}. Thus, we can refine the bound in (43) to

(41)(2+2)t=jW(j+1)W-1𝟏((st,at)=(s,a)). (44)

The required inequality (40) is finally proved by summing (44) over s𝒮a𝒜 and applying Cauchy Schwartz:


Altogether, the Lemma is proved. \@endparenv

Following the discussion in Section 3, we show that the MDP constituted by p^t is not “similar” to the MDP constitute by pt for any t[T] (in the sense shown by Jaksch et al. (2010)) by demonstrating that non-stationary MDPs present a difficult environment for learning directly from time average.

Lemma 8

There exists a sequence of non-stationary MDP transition kernels p1,,pN (which have the same S,A) such that

  1. 0.

    The diameter of (𝒮,𝒜,pn) is 1 for each n[N];

  2. 0.

    The total variation budget is Bp=O(1);

  3. 0.

    Under some deterministic policy, the empirical MDP constituted by (𝒮,𝒜,p~N) has diameter Ω(N/logN) for every p~NHp,N(0),

The discrepancy between (𝒮,𝒜,p~N) and (𝒮,𝒜,p1),,(𝒮,𝒜,pN) is due to the fact that, for different (s,a), the time average p¯N(|s,a) involve the transition probability from different time steps. The Lemma demonstrates that the optimistic MDP derived from the time-averaged transition kernel p^N can be very different from each of the MDPs constituted by (𝒮,𝒜,p1),,(𝒮,𝒜,pN).

In fact, even if the DM can construct, for each s,a, the desired time average p¯(|s,a)=n=1Npn(|s,a), the resulting MDP constituted by (𝒮,𝒜,p¯) can still be very different from the MDPs constituted by {(𝒮,𝒜,pn)}n=1N:

Lemma 9

There exists p1,,pN such that

  1. 0.

    (𝒮,𝒜,pn) constitutes an MDP with diameter at most S.

  2. 0.


  3. 0.

    (𝒮,𝒜,p¯) constitutes an MDPs with diamter at least Ω(S2)

Before going to the proofs of Lemmas 8, 9, we remark that there is no contradiction between Lemma 8 and (Jaksch et al. 2010, Gajane et al. 2018), who demonstrate a O(1/3T2/3) dynamic regret in a piece-wise stationary environment with at most changes (The work by (Gajane et al. 2018) improves upon the work by (Jaksch et al. 2010) in the dependence on D,S,A. Throughout this paragraph, we hide the dependence on D,S,A in the O(),Θ() notations, and only show the dependence on ,T). Indeed, they partition the whole horizon into T in Θ(2/3T1/3) time intervals of length Θ((T/)2/3) each, and run UCRL2 on each. The main argument is that at least Θ(2/3T1/3)- intervals enjoy stationary MDP environments. In the interval with non-stationary MDP, which are at most many, they bound the dynamic regret directly as times the length of each interval, which is equal to O(2/3T1/3). The latter linear-in-time-horizon bound does not contradict our previous claim that diameter is unbounded and the application of UCRL2 can be ruined.


Proof of Lemma 8]

The sequence p1,,pN alternates between the following 2 instances p1,p2. Now, define the common state space 𝒮={1,2} and action collection 𝒜={𝒜1,𝒜2}, where 𝒜1={a1,a2}, {𝒜2}={b1,b2}.

For p1, we have


For p2, we have


Please refer to Fig. 2 for a graphical illustration.

Figure 2: Example MDPs. Since the transitions are deterministic, the probabilities are omitted.

Clearly, we see that both instances have diameter 1. Now, consider the following two deterministic and stationary policy π1:


and π2


Since the MDP is deterministic, we have p^N=p¯N.

In the following, we construct a trajectory where the DM alternates between policies π1,π2 during time {1,,N} while the underlying transition kernel alternates between p1,p2. In the construction, the DM is almost always at the self-loop at state 1 (or 2) throughout the horizon, no matter what action a1,a2 (or b1,b2) she takes. Consequently, it will trick the DM into thinking that p^N(1|1,ai)1 for each i{1,2}, and likewise p^N(2|2,bi)1 for each i{1,2}. Altogether, this will lead the DM to conclude that (𝒮,𝒜,p^N) constitute a high diameter MDP, since the probability of transiting from state 1 to 2 (or 2 to 1) would appear very low.

The construction is detailed as follows. Let N=4τ+2. In addition, let

  • p1=p2==pτ=p1,

  • pτ+1=pτ+2==p2τ+1=p2,

  • p2τ+3=p2τ+4==p3τ+1=p1,

  • p3τ+2=p3τ+3==p4τ+2=p2.

The DM starts at state 1. It follows

  • policy π1 from time 1 to time 2τ+1,

  • policy π2 from 2τ+2 to 4τ+2.

Under the specified MDP models and policies, it can be readily verified that the DM takes

  • action a1 from time 1 to τ+1,

  • action b2 from time τ+2 to 2τ+1,

  • action b1 from time 2τ+2 to 3τ+2,

  • action a2 from time 3τ+3 to 4τ+2.

As a result, the DM is at

  • state 1 from time 1 to τ+1,

  • state 2 from time τ+2 to 2τ+1,

  • state 2 from time 2τ+2 to 3τ+3,

  • state 1 from time 3τ+3 to 4τ+2.

From the dynamics above, we can see that


Finally, for the confidence region Hp,N+1(0)={Hp,N+1(s,a;0)}s,a constructed without confidence widening, we know that for any p~Hp,N+1 we have p~(2|1,a1),p~(2|1,a2),p~(1|2,b1),p~(1|2,b2)=O(logNτ+1), since the stochastic confidence radius Θ(logNτ+1) dominates the sample mean 1τ+1. Therefore, for any p~Hp,N+1, the diameter of the MDP constructed by (𝒮,𝒜,p~) is at least Ω(N/logN). \@endparenv


Proof of Lemma 9] Denote 𝒮={1,,S}. For each 1nN/2 and each s𝒮, we have the set 𝒜s={a} being a singleton, with pn(s+1|s,a)=1 when 1s<S, and pn(1|S,a)=1. That is, the MDP is just a Markov Chain on a directed cycle. For each N/2+1nN and each s𝒮, we have the reverse: pt(s-1|s,a)=1 when 1<sS, and pn(S|1,a)=1. That is, when N/2+1nN, the directed cycle is reversed. It is clear that each (𝒮,𝒜,pn) constitutes an MDP with diameter at most S. However, the resulting time average transition kernel p¯ has property that p¯(s-1|s,a)=p¯(s+1|s,a)=1/2 for s=2,,S-1, and p¯(S-1|S,a)=p¯(1|S,a)=1/2, and p¯(S|1,a)=p¯(2|1,a)=1/2. It can be easily verified that the resulting MDP formed by (𝒮,𝒜,p¯) has diameter at least Ω(S2) as follows.

We follow the notations in Definition 1 to use Λ(s|π,s) to denote the hitting time from state s to s. Since we only have one action, we suppress π for brevity. Note that s𝒮{S2}

𝔼[Λ(S2|s)]=1+12𝔼[Λ(S2|(s+1)modS)]+12𝔼[Λ(S2|(s-1)modS)], (45)

which indicate

𝔼[Λ(S2|S2+1)]=𝔼[Λ(S2|S2-1)]=(S-1), (46)

where we have made use of the fact that Λ(S2|S2)=0 and 𝔼[Λ(S2|S2+1)]=𝔼[Λ(S2|S2-1)]. Rearranging terms in eq. (45), we also have s𝒮{S2}

𝔼[Λ(S2|s)-Λ(S2|(s-1)modS)]=2+𝔼[Λ(S2|(s+1)modS)-Λ(S2|s)]. (47)

That is to say s<S2,

𝔼[Λ(S2|(s+1)modS)-Λ(S2|s)]=-(S-1)+2(S2-s+1) (48)


𝔼[Λ(S2|1)]=s=1S/2-1𝔼[Λ(S2|s)-Λ(S2|s+1)]=s=1S/2-1(2s-3)=(S-6)(S-2)4. (49)

We next propose a way to deviate from Assumption 2 by examining the case of inventory control (similar arguments can be made for queuing systems and beyond). In inventory control Bertsekas (2017), a seller decides the stock level (state) by ordering a quantity of a certain item (action) in each of T rounds. For each round, a demand is realized (state transition), and unfulfilled demand incurs a shortage cost while excess inventory leads to a holding cost (reward). The seller’s objective is to minimize the cumulative shortage+holding cost. With some abuse of notations, we interpret state s as the level of stock, i.e., in state s, the level of stock is s. Then as long as the demand distribution has at least ζ mass on each of the support {0,1,,S}, the DM could always choose to order S-s quantity of the item so that with probability at least ζ, the demand realization is S-s. That is to say, pt(s|s,S-s)ζ for any t[T]. Now it is easy to see that for any t[T], we have the diameter of the time average MDP (𝒮,𝒜,p¯t) is at most D¯=1/ζ. In short, the system presented here can transition from states to states relatively easily. We formalize this observation as follows.

Assumption 3

There exists ζ>0 (not necessarily known to the DM), such that for any states s,sS, there is always an action as,sAs that satisfies pt(s|s,a(s,s))ζ for all t[T].

Under this very mild assumption, the DM can then implement the SWUCRL2-CW algorithm with η=0 as p¯tHp,t(0) for all t[T] with high probability by Lemma 1. To analyze its performance, we update Proposition 1 as follows

Proposition 4

Conditioning on events Er,Ep, for every episode m and every time t in the episode (i.e. t{τ(m),,τ(m+1)-1}) the following inequality holds for certainty:

+ [2𝗏𝖺𝗋r,t+4D¯𝗏𝖺𝗋p,t]+[2𝗋𝖺𝖽-rτ(m)(st,at)+4D¯𝗋𝖺𝖽-pτ(m)(s,a)].

The proof is very similar to that of Proposition 1 with Dτ(m) and η replaced by D¯ and 0, respectively.

Theorem 3

Assuming S>1, the SWUCRL2-CW algorithm with window size W and η=0 satisfies the dynamic regret bound


Here, we set δ=1/T. If we further put W=W*=S2/3A1/3T2/3/(Br+Bp+1)2/3 this is O~(D¯(Br+Bp+1)1/3S2/3A1/3T2/3).

The proof is very similar to that of Theorem 1, and is thus omitted.

By virtue of the EXP3.P, the BORL algorithm is able to adapt to any choice of (W,η)J. Specifically, we have from eq. (17) that for any choice of (W,η)J, the dynamic regret of the BORL algorithm is

Dyn-RegT(𝙱𝙾𝚁𝙻)= O~(BrW+Dmax[BpW+SATW+Tη+SATW+TH])
= O~(BrW+Dmax[BpW+SATW+Tη+SATW+S12A14T34]). (50)

Also note that

1η*=S23A13(Bp+1)2T13(Br+Bp+1)23S23A13(Bp+1)132T13S23A132T13=Φ. (51)

Therefore, there must exists some k such that

Φk/Δηη*Φ(k+1)/Δη (52)

By adapting η to Φk/Δη, we further upper bound eq. (50) as

= O~(BrW+Dmax[BpW+SATW+Tη*Φ1/Δη+SATW+S12A14T34])
= O~(BrW+Dmax[BpW+SATW+(Bp+1)S23A13T23(Br+Bp+1)23+SATW+S12A14T34]), (53)

where we have make use of the fact that Φ1/Δη=exp(1) in the last step.

We then turn to the choice of W by distinguish the following two cases:

  • W*H or (Br+Bp+1)T1/4/(S1/2A1/4): under this case, the BORL algorithm is able to adapt W to some Hj/ΔW such that

    Hj/ΔWW*H(j+1)/ΔW. (54)

    Note again that H1/ΔW=O(1), the dynamic regret of the BORL algorithmis

    = O~(BrW*+Dmax[BpW*+SATW*+(Bp+1)S23A13T23(Br+Bp+1)23+SATW*+S12A14T34])
    = O~(Dmax(Br+Bp+1)13S23A13T23+DmaxS12A14T34). (55)
  • W*>H or (Br+Bp+1)<T1/4/(S1/2A1/4): under this case, the BORL algorithm is only able to adapt W=H the dynamic regret of the BORL algorithmis

    = O~(BrH+Dmax[BpH+SATH+(Bp+1)S23A13T23(Br+Bp+1)23+SATH+S12A14T34])
    = O~(DmaxS12A14T34) (56)

Combining the above two cases, we conclude the proof.