Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism

  • 2020-05-18 16:34:50
  • Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
  • 0

Abstract

We consider un-discounted reinforcement learning (RL) in Markov decisionprocesses (MDPs) under temporal drifts, ie, both the reward and statetransition distributions are allowed to evolve over time, as long as theirrespective total variations, quantified by suitable metrics, do not exceedcertain variation budgets. This setting captures the endogeneity, exogeneity,uncertainty, and partial feedback in sequential decision-making scenarios, andfinds applications in vehicle remarketing and real-time bidding. We firstdevelop the Sliding Window Upper-Confidence bound for Reinforcement Learningwith Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamicregret bound when the variation budgets are known. In addition, we propose theBandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune theSWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in aparameter-free manner, ie, without knowing the variation budgets. Finally, weconduct numerical experiments to show that our proposed algorithms achievesuperior empirical performance compared to existing algorithms. Notably, the interplay between endogeneity and exogeneity presents a uniquechallenge, absent in existing (stationary and non-stationary) stochastic onlinelearning settings, when we apply the conventional Optimism in Face ofUncertainty principle to design algorithms with provably low dynamic regret forRL in drifting MDPs. We overcome the challenge by a novel confidence wideningtechnique that incorporates additional optimism into our learning algorithms toensure low dynamic regret bounds. To extend our theoretical findings, we applyour framework to inventory control problems, and demonstrate how one canalternatively leverage special structures on the state transition distributionsto bypass the difficulty in exploring time-varying environments.

 

Quick Read (beta)

Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism

Wang Chi Cheung

Department of Industrial Systems Engineering and Management, National University of Singapore

[email protected]

David Simchi-Levi

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139

[email protected]

Ruihao Zhu

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139

[email protected]

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. This setting captures endogeneity, exogeneity, uncertainty, and partial feedback in sequential decision-making scenarios, and finds applications in various online marketplaces, epidemic control, and transportation. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared with existing algorithms.

Notably, the interplay between endogeneity and exogeneity presents a unique challenge, absent in existing (stationary and non-stationary) stochastic online learning settings, when one applies the conventional Optimism in Face of Uncertainty (OFU) principle to design algorithms with provably low dynamic regret for RL in non-stationary MDPs. We overcome this challenge by a novel confidence widening technique that incorporates additional optimism into our learning algorithms to ensure low dynamic regret bounds. To extend our theoretical findings, we demonstrate, in the context of single item inventory control with fixed cost, how one can leverage special structures on the state transition distributions to bypass the difficulty of exploring time-varying environments.

Key words: reinforcement learning, data-driven decision making, revenue management, confidence widening

 

Consider a general sequential decision-making framework, where a decision-maker (DM) interacts with an initially unknown environment iteratively. At each time step, the DM first observes the current state of the environment, and then chooses an available action. After that, she receives an instantaneous random reward, and the environment transitions to the next state. The DM aims to design a policy that maximizes its cumulative rewards, while facing the following challenges:

  • Endogeneity: At each time step, the reward follows a reward distribution, and the subsequent state follows a state transition distribution. Both distributions depend (solely) on the current state and action, which are influenced by the policy. Hence, the environment can be fully characterized by a discrete time Markov decision process (MDP).

  • Exogeneity: The reward and state transition distributions vary (independently of the policy) across time steps, but the total variations are bounded by the respective variation budgets.

  • Uncertainty: Both the reward and state transition distributions are initially unknown to the DM.

  • Bandit/Partial Feedback: The DM can only observe the reward and state transition resulted by the current state and action in each time step.

It turns out that many applications, such as vehicle remarketing in used-car sales and real-time bidding in advertisement (ad) auctions, can be captured by this framework.

Example 1 (Vehicle Remarketing in Used-Cars Sales)

An automobile company disposes of continually arriving off-lease vehicles (i.e., leasing vehicles that have reached the end of their fixed term) via daily wholesale vehicle auctions (Manheim 2020, Vehicle Remarketing 2020). At the beginning of each auction, the company observes the number of on-hand vehicles (the “state”), and decides the number of off-lease vehicles to be listed (the “action”). Then, the car dealers bid for the purchases via a first-price auction. The sales of vehicles generate revenue to the company while unsold vehicles incur holding cost to the company (the “reward” and “state transition”). The company aims at maximizing profit by designing a policy that dynamically decides the vehicles to be listed in each auction. However, the dealers’ bidding behaviors are affected by many unpredictable (and thus exogenous) factors (e.g., real-time customer demands, vehicles’ depreciation, and inter-dealer competitions) in addition to the company’s decisions (i.e., the vehicles listed), and can vary across time.

Example 2 (Real-Time Bidding in Ads Auctions)

Advertisers repeatedly competes for ad display impressions via real-time online auctions (Google 2011, Cai et al. 2017, Flajolet and Jaillet 2017, Balseiro and Gur 2019, Guo et al. 2019). Each advertiser begins with a budget. Upon the arrival of a user, an impression is generated, and the advertisers submit bids (the “action”) for it subject to her remaining budget (the “state”). The winning advertiser acquires the impression to display her ad to the user, and observes the user click or no-click behavior (the “reward”). For each slot won, the advertiser has to make the payment (determined by the auction mechanism) using her remaining budget, and the budget is periodically refilled (the “state transition”). Each advertiser wants to maximize the number of clicks on her advertisement subject to her own (continuously evolving) budget constraint. Nevertheless, the competitiveness of each auction exhibits exogeneity as the participating advertisers and the arriving users are different from time to time. Moreover, the popularity of an ad can change due to endogenous reasons. For instance, displaying the same ad too frequently in a short period of time might reduce its freshness, and results in a tentatively low number of clicks (i.e., we can incorporate both the remaining budget and the number of times that the ad is shown within a given window size into the state of the MDP to model endogenous dynamics).

Besides, this framework can be used to model sequential decision-making problems in transportation (Zhang and Wang 2018, Qin et al. 2019), wireless network (Zhou and Bambos 2015, Zhou et al. 2016), consumer choice modeling (Xu and Yun 2020), healthcare operations (Shortreed et al. 2010), epidemic control (Nowzari et al. 2016, Kiss et al. 2017), and inventory control (Huh and Rusmevichientong 2009, Bertsekas 2017, Zhang et al. 2018, Agrawal and Jia 2019, Chen et al. 2019a).

There exists numerous works in sequential decision-making that considered part of the four challenges (Please refer to Table 1 for a summary and comparison). The traditional stream of research (Auer et al. 2002b, Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018) on stochastic multi-armed bandits (MAB) focused on the interplay between uncertainty and bandit feedback (i.e., challenges 3 and 4), and (Auer et al. 2002b) proposed the classical Upper Confidence Bound (UCB) algorithm. Starting from (Burnetas and Katehakis 1997, Tewari and Bartlett 2008, Jaksch et al. 2010), a volume of works (see Section id1) have been devoted to reinforcement learning (RL) in MDPs (Sutton and Barto 2018), which further involves endogeneity. RL in MDPs incorporate challenges 1,3,4, and stochastic MAB is a special case of MDPs when there is only one state. In the absence of exogeneity, the reward and state transition distributions are invariant across time, and these three challenges can be jointly solved by the Upper Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010).

The UCB and UCRL2 algorithms leverage the optimism in face of uncertainty (OFU) principle to select actions iteratively based on the entire collections of historical data. However, both algorithms quickly deteriorate when exogeneity emerge since the environment can change over time, and the historical data becomes obsolete. To address the challenge of exogeneity, (Garivier and Moulines 2011) considered the piecewise-stationary MAB environment where the reward distributions remain unaltered over certain time periods and change at unknown time steps. Later on, there is a line of research initiated by (Besbes et al. 2014) that studied the general non-stationary MAB environment (Besbes et al. 2014, Cheung et al. 2019b, a), in which the reward distributions can change arbitrarily over time, but the total changes (quantified by a suitable metric) is upper bounded by a variation budget (Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions. Both the (relatively restrictive) piecewise-stationary MAB and the general non-stationary MAB settings consider the challenges of exogeneity, uncertainty, and partial feedback (i.e., challenges 2, 3, 4), but endogeneity (challenge 1) are not present.

In this paper, to address all four above-mentioned challenges, we consider RL in non-stationary MDPs where bot the reward and state transition distributions can change over time, but the total changes (quantified by suitable metrics) are upper bounded by the respective variation budgets. We note that in (Jaksch et al. 2010), the authors also consider the intermediate RL in piecewise-stationary MDPs. Nevertheless, we first demonstrate in Section id1, and then rigorously show in Section id1 that simply adopting the techniques for non-stationary MAB (Besbes et al. 2014, Cheung et al. 2019b, a) or RL in piecewise-stationary MDPs (Jaksch et al. 2010) to RL in non-stationary MDPs may result in poor dynamic regret bounds.

Endogeneity Exogeneity Uncertainty Bandit feedback
Stationary MAB
RL in stationary MDPs
Non-stationary MAB
RL in non-stationary MDPs
Table 1: Summary of different sequential decision-making settings. Among them, RL in non-stationary MDPs is the only setting that addresses all four challenges.

Assuming that, during the T time steps, the total variations of the reward and state transition distributions are bounded (under suitable metrics) by the variation budgets Br(>0) and Bp(>0), respectively, we design and analyze novel algorithms for RL in non-stationary MDPs. Let Dmax, S, and A be respectively the maximum diameter (a complexity measure to be defined in Section id1), number of states, and number of actions in the MDP. Our main contributions are:

  • We develop the Sliding Window UCRL2 with Confidence Widening (SWUCRL2-CW) algorithm. When the variation budgets are known, we prove it attains a O~(Dmax(Br+Bp)1/4S2/3A1/2T3/4) dynamic regret bound via a budget-aware analysis.

  • We propose the Bandit-over-Reinforcement Learning (BORL) algorithm that tunes the SWUCRL2-CW algorithm adaptively, and retains the same O~(Dmax(Br+Bp)1/4S2/3A1/2T3/4) dynamic regret bound without knowing the variation budgets.

  • We identify an unprecedented challenge for RL in non-stationary MDPs with conventional optimistic exploration techniques: existing algorithmic frameworks for non-stationary online learning (including non-stationary bandit and RL in piecewise-stationary MDPs) (Jaksch et al. 2010, Garivier and Moulines 2011, Cheung et al. 2019b) typically estimate unknown parameters by averaging historical data in a “forgetting” fashion, and construct the tightest possible confidence regions/intervals accordingly. They then optimistically search for the most favorable model within the confidence regions, and execute the corresponding optimal policy. However, we first demonstrate in Section id1, and then rigorously show in Section id1 that in the context of RL in non-stationary MDPs, the diameters induced by the MDPs in the confidence regions constructed in this manner can grow wildly, and may result in unfavorable dynamic regret bound. We overcome this with our novel proposal of extra optimism via the confidence widening technique. A summary of the algorithmic frameworks for stationary and non-stationary online learning settings are provided in Table 2.

    Stationary Non-stationary
    MAB OFU (Auer et al. 2002b) OFU + Forgetting (Besbes et al. 2014, Cheung et al. 2019a)
    RL OFU (Jaksch et al. 2010) Extra optimism + Forgetting (This paper)
    Table 2: Summary of algorithmic frameworks of stationary and non-stationary online learning settings.
  • As a complement to this finding, suppose for any pair of initial state and target state, there always exists an action such that the probability of transiting from the initial state to the target state by taking this action is lower bounded uniformly over the entire time horizon, the DM can attain low dynamic regret without widening the confidence regions. We demonstrate that in the context of single item inventory control with fixed cost (Yuan et al. 2019), a mild condition on the demand distribution is sufficient for this extra assumption to hold.

The rest of the paper is organized as follows: in Section id1, we describe the non-stationary MDP model of interest. In Section id1, we review related works in non-stationary online learning and reinforcement learning. In Section id1, we introduce the SWUCRL2-CW algorithm, and analyze its performance in terms of dynamic regret. In Section 7, we design the BORL algorithm that can attain the same dynamic regret bound as the SWUCRL2-CW algorithm without knowing the total variations. In Section id1, we discuss the challenges in designing learning algorithms for reinforcement learning under drift, and manifest how the novel confidence widening technique can mitigate this issue. In Section 9, we discuss the alternative approach without widening the confidence regions in inventory control problems. In Section 11, we conduct numerical experiments to show the superior empirical performances of our algorithms. In Section id1, we conclude our paper.

In this section, we introduce the notations to be used throughout paper, and introduce the learning protocol for our problem of RL in non-stationary MDPs.

Throughout the paper, all vectors are column vectors, unless specified otherwise. We define [n] to be the set {1,2,,n} for any positive integer n. We denote 𝟏[] as the indicator function. For p[1,], we use xp to denote the p-norm of a vector xd. We denote xy and xy as the maximum and minimum between x,y, respectively. We adopt the asymptotic notations O(),Ω(), and Θ() (Cormen et al. 2009). When logarithmic factors are omitted, we use O~(),Ω~(), Θ~(), respectively. With some abuse, these notations are used when we try to avoid the clutter of writing out constants explicitly.

Model Primitives: An instance of non-stationary MDP is specified by the tuple (𝒮,𝒜,T,r,p). The set 𝒮 is a finite set of states. The collection 𝒜={𝒜s}s𝒮 contains a finite action set 𝒜s for each state s𝒮. We say that (s,a) is a state-action pair if s𝒮,a𝒜s. We denote S=|𝒮|, A=(s𝒮|𝒜s|)/S. We denote T as the total number of time steps, and denote r={rt}t=1T as the sequence of mean rewards. For each t, we have rt={rt(s,a)}s𝒮,a𝒜s, and rt(s,a)[0,1] for each state-action pair (s,a). In addition, we denote p={pt}t=1T as the sequence of state transition distributions. For each t, we have pt={pt(|s,a)}s𝒮,a𝒜s, where pt(|s,a) is a probability distribution over 𝒮 for each state-action pair (s,a).

Exogeneity: The quantities rt’s and pt’s vary across different t’s in general. Following (Besbes et al. 2014), we quantify the variations on rt’s and pt’s in terms of their respective variation budgets Br,Bp(>0):

Br =t=1T-1Br,t, where Br,t=maxs𝒮,a𝒜s|rt+1(s,a)-rt(s,a)|,
Bp =t=1T-1Bp,t, where Bp,t=maxs𝒮,a𝒜spt+1(|s,a)-pt(|s,a)1. (1)

We emphasize although Br and Bp might be used as inputs by the DM, individual Br,t’s and Bp,t’s are unknown to the DM throughout the current paper. We also refer to Remark 2 for the choice of infinity-norm and 1-norm in eqn. (1).

Endogeneity: The DM faces a non-stationary MDP instance (𝒮,𝒜,T,r,p). She knows 𝒮,𝒜,T, but not r,p. The DM starts at an arbitrary state s1𝒮. At time t, three events happen. First, the DM observes its current state st. Second, she takes an action at𝒜st. Third, given st,at, she stochastically transits to another state st+1 which is distributed as pt(|st,at), and receives a stochastic reward Rt(st,at), which is 1-sub-Gaussian with mean rt(st,at). In the second event, the choice of at is based on a non-anticipatory policy Π. That is, the choice only depends on the current state st and the previous observations t-1:={sq,aq,Rq(sq,aq)}q=1t-1.

Dynamic Regret: The DM aims to maximize the cumulative expected reward 𝔼[t=1Trt(st,at)], despite the model uncertainty on r,p and the dynamics of the learning environment. To measure the convergence to optimality, we consider an equivalent objective of minimizing the dynamic regret (Besbes et al. 2014, Jaksch et al. 2010)

Dyn-RegT(Π)=t=1T{ρt*-𝔼[rt(st,at)]}. (2)

In the oracle t=1Tρt*, the summand ρt* is the optimal long-term average reward of the stationary MDP with state transition distribution pt and mean reward rt. The optimum ρt* can be computed by solving linear program (15) provided in Section id1. We note that the same oracle is used for RL in piecewise-stationary MDPs (Jaksch et al. 2010).

Remark 1 (Comparisons with Non-Stationary MAB)

When S=1, eqn. (2) reduces to the definition (Besbes et al. 2014) of dynamic regret for non-stationary K-armed bandit. Different from the bandit case, however, the oracle t=1Tρt* does not equal to the expected optimum for the non-stationary MDP problem in general. Nevertheless, we justify this choice in Proposition 1.

Remark 2 (Definition of Variation Budgets)

For brevity of exposition, we choose to define the variation budgets (see eqn. (1) ) for reward and state transition distributions with the infinity norm and 1 norm, respectively. One can also define them with respect to other commonly used metrics, such as the 2 norm (Cheung et al. 2019b), and the this would only affect the dependence on S and A for the established dynamic regret bounds in the subsequent sections.

Next, we review relevant concepts on MDPs, in order to stipulate an assumption that ensures learnability and justifies our oracle.

Definition 1 (Communicating MDPs and Diameter (Jaksch et al. 2010))

Consider a set of states S, a collection A={As}sS of action sets, and a state transition distribution p¯={p¯(|s,a)}sS,aAs. For any s,sS and stationary policy π, the hitting time from s to s under π is the random variable Λ(s|π,s):=min{t:st+1=s,s1=s,sτ+1p¯(|sτ,π(sτ)) τ}, which can be infinite. We say that (S,A,p¯) is a communicating MDP iff D:=maxs,sSminstationary πE[Λ(s|π,s)] is finite. The quantity D is the diameter associated with (S,A,p¯).

Remark 3 (Diameter and RL in MDPs)

As shown in (Jaksch et al. 2010), “diameter” plays a fundamental role in characterizing the complexity of RL in MDPs. Intuitively, in order to make informative decisions, the DM has to have accurate estimates of the quantities rt(s,a)’s and pt(|s,a).’s. In other words, she must visit every state sS and choose each of its available actions aAs frequently enough to collect relevant samples. Consequently, the harder to transition from a state s to another state s, the more the DM would suffer during the learning process, and the diameter of a MDP captures the “hardness” of transitioning between states in this MDP.

With the above remark, we make the following assumption.

Assumption 1 (Bounded Diameters)

For each t[T], the tuple (S,A,pt) constitutes a communicating MDP with diameter at most Dt. We denote the maximum diameter as D𝑚𝑎𝑥=maxt{1,,T}Dt.

The following proposition justifies our choice of oracle t=1Tρt*.

Proposition 1

Consider an instance (S,A,T,p,r) that satisfies Assumption 1 with maximum diameter D𝑚𝑎𝑥, and has variation budgets Br,Bp for the rewards and transition distributions respectively. In addition, suppose that TBr+2D𝑚𝑎𝑥Bp>0, then it holds that

t=1Tρt*maxΠ{𝔼[t=1Trt(stΠ,atΠ)]}-4(D𝑚𝑎𝑥+1)(Br+2D𝑚𝑎𝑥Bp)T.

The maximum is taken over all non-anticipatory policies Π’s. We denote {(stΠ,atΠ)}t=1T as the trajectory under policy Π, where atΠAstΠ is determined based on Π and Ht-1{stΠ}, and st+1Πpt(|stΠ,atΠ) for each t.

The Proposition is proved in section id1 of the appendix. In fact, our dynamic regret bounds (see the forthcoming Theorems 1 and 2) are larger than the error term 4(Dmax+1)(Br+2DmaxBp)T, thus justifying the choice of t=1Tρt* as the oracle. It turns out that the oracle t=1Tρt* is more convenient for analysis than the expected optimum, since the former can be decomposed to summations across different intervals, unlike the latter where the summands are intertwined due to endogenous dynamics, i.e., st+1Πpt(|stΠ,atΠ).

RL in stationary (discounted and un-discounted reward) MDPs has been widely studied in (Burnetas and Katehakis 1997, Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b, Sidford et al. 2018b, a, Wang 2019, Zhang and Ji 2019, Fruit et al. 2019, Wei et al. 2019). For the discounted reward setting, the authors of (Sidford et al. 2018b, Wang 2019, Sidford et al. 2018a) proposed (nearly) optimal algorithms in terms of sample complexity. For the un-discounted reward setting, the authors of (Jaksch et al. 2010) established a minimax lower bound Ω(DmaxSAT) on the regret when both the reward and state transition distributions are time-invariant. They also designed the UCRL2 algorithm and showed that it attains a regret bound O~(DmaxSAT). The authors of (Fruit et al. 2019) proposed the UCRL2B algorithm, which is an improved version of the UCRL2 algorithm. The regret bound of the UCRL2B algorithm is O~(SDmaxAT+Dmax2S2A). The minimax optimal algorithm is provided in (Zhang and Ji 2019) although it is not computationally efficient.

In a parallel work (Ortner et al. 2019), the authors considered a similar setting to ours by applying the “forgetting principle” from non-stationary bandit settings (Garivier and Moulines 2011, Cheung et al. 2019a) to design a learning algorithm. To achieve its dynamic regret bound, the algorithm by (Ortner et al. 2019) partitions the entire time horizon [T] into time intervals ={Ik}k=1K, and crucially requires the access to t=minIkmaxIk-1Br,t and t=minIkmaxIk-1Bp,t, i.e., the variations in both reward and state transition distributions of each interval Ik (see Theorem 3 in (Ortner et al. 2019)). In contrast, the SWUCRL2-CW algorithm and the BORL algorithm require significantly less information on the variations. Specifically, the SWUCRL2-CW algorithm does not need any additional knowledge on the variations except for Br and Bp, i.e., the variation budgets over the entire time horizon as defined in eqn. (1), to achieve its dynamic regret bound (see Theorem 1). This is similar to algorithms for the non-stationary bandit settings, which only require the access to Br (Besbes et al. 2014). More importantly, the BORL algorithm (built upon the SWUCRL2-CW algorithm) enjoys the same dynamic regret bound even without knowing either of Br or Bp (see Theorem 2).

There also exists some settings that are closely related to, but different than our setting (in terms of exogeneity and feedback). (Jaksch et al. 2010, Gajane et al. 2018) proposed solutions for the RL in piecewise-stationary MDPs setting. But as discussed in Section id1, simply applying their techniques to the general RL in non-stationary MDPs may result in undesirable dynamic regret bounds (see Section id1 for more details). In (Yu et al. 2009, Neu et al. 2010, Arora et al. 2012, Dick et al. 2014, Jin et al. 2019, Cardoso et al. 2019), the authors considered RL in MDPs with changing reward distributions but fixed transition distributions. The authors of (Even-Dar et al. 2005, Yu and Mannor 2009, Neu et al. 2012, Abbasi-Yadkori et al. 2013, Rosenberg and Mansour 2019, Li et al. 2019) considered RL in non-stationary MDPs with full information feedback.

For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002a, Garivier and Moulines 2011, Besbes et al. 2014, Keskin and Zeevi 2016) proposed several “forgetting” strategies for different non-stationary MAB settings. More recently, the works by (Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019b, a, Chen et al. 2019b) designed parameter-free algorithms for non-stationary MAB problems. Another related but different setting is the Markovian bandit (Kim and Lim 2016, Ma 2018), in which the state of the chosen action evolve according to an independent time-invariant Markov chain while the states of the remaining actions stay unchanged. In (Zhou et al. 2020), the authors also considered the case when the states of all the actions are governed by the same (uncontrollable) Markov chain.

In this section, we first describe a unique challenge in RL in non-stationary MDPs, and then present the SWUCRL2-CW algorithm, which incorporates our novel confidence widening technique and sliding window estimates (Garivier and Moulines 2011) into UCRL2 (Jaksch et al. 2010).

For stationary MAB problems, the UCB algorithm (Auer et al. 2002b) suggests the DM should iteratively execute the following two steps in each time step:

  1. 0.

    Estimate the mean reward of each action by taking the time average of all observed samples.

  2. 0.

    Pick the action with the highest estimated mean reward plus the confidence radius, where the radius scales inversely proportional with the number of observations (Auer et al. 2002b).

The UCB algorithm has been proved to attain optimal regret bounds for various stationary MAB settings (Auer et al. 2002b, Kveton et al. 2015). For non-stationary problems, (Garivier and Moulines 2011, Keskin and Zeevi 2016, Cheung et al. 2019a) shown that the DM could further leverage the forgetting principle by incorporating the sliding-window estimator (Garivier and Moulines 2011) into the UCB algorithms (Auer et al. 2002b, Kveton et al. 2015) to achieve optimal dynamic regret bounds for a wide variety of non-stationary MAB settings. The sliding window UCB algorithm with a window size W+ is similar to the UCB algorithm except that the estimated mean rewards are computed by taking the time average of the W most recent observed samples.

As noted in Section id1, (Jaksch et al. 2010) proposed the UCRL2 algorithm, which is a UCB-alike algorithm with nearly optimal regret for RL in stationary MDPs. It is thus tempting to think that one could also integrate the forgetting principle into the UCRL2 algorithm to attain low dynamic regret bound for RL in non-stationary MDPs. In particular, one could easily design a naive sliding-window UCRL2 algorithm that follows exactly the same steps as the UCRL2 algorithm with the exception that it uses only the W most recent observed samples instead of all observed samples to estimate the mean rewards and the state transition distributions, and to compute the respective confidence radius.

Under non-stationarity and bandit feedback, however, we show in Proposition 3 of the forthcoming Section id1 that the diameter of the estimated MDP produced by the naive sliding-window UCRL2 algorithm with window size W can be as large as Θ(W), which is orders of magnitude larger than Dmax, the maximum diameter of each individual MDP encountered by the DM. Consequently, the naive sliding-window UCRL2 algorithm may result in undesirable dynamic regret bound. In what follows, we discuss in more details how our novel confidence widening technique can mitigate this issue.

The SWUCRL2-CW algorithm first specifies a sliding window parameter W and a confidence widening parameter η0. Parameter W specifies the number of previous time steps to look at. Parameter η quantifies the amount of additional optimistic exploration, on top of the conventional optimistic exploration using upper confidence bounds. The latter turns out to be helpful for handling the temporal drifts in the state transition distributions (see Section id1).

The algorithm runs in a sequence of episodes that partitions the T time steps. Episode m starts at time τ(m) (in particular τ(1)=1), and ends at the end of time step τ(m+1)-1. Throughout an episode m, the DM follows a certain stationary policy π~τ(m). The DM ceases the mth episode if at least one of the following two criteria is met:

  • The time index t is a multiple of W. Consequently, each episode last for at most W time steps. The criterion ensures that the DM switches the stationary policy π~τ(m) frequently enough, in order to adapt to the exogenous dynamics.

  • There exists some state-action pair (s,a) such that ντ(m)(s,a), the number of time step t’s with (st,at)=(s,a) within episode m, is at least as many as the total number of counts for it within the W time steps prior to τ(m), i.e., from (τ(m)-W)1 to (τ(m)-1). This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.

The combined effect of these two criteria allows the DM to learn a low dynamic regret policy with historical data from an appropriately sized time window and confidence widening parameter. One important piece of ingredient is the construction of the policy π~τ(m), for each episode m. To allow learning under endogenous and exogenous dynamics, the SWUCRL2-CW algorithm computes the policy π~m based on the history in the W time steps prior to the current episode m, i.e., from round (τ(m)-W)1 to round τ(m)-1. The construction of π~τ(m) involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence regions Hr,τ(m),Hp,τ(m)(η) for rewards and state transition distributions as the inputs, in addition to an precision parameter ϵ. The confidence widening parameter η0 is capable of ensuring the MDP output by the EVI has a bounded diameter most of the time.

To describe SWUCRL2-CW algorithm, we first define for each state-action pair (s,a) and each time t in episode m,

Nt(s,a)=q=(τ(m)-W)1t-1𝟏((sq,aq)=(s,a)),Nt+(s,a)=max{1,Nt(s,a)}. (3)

For each state-action pair (s,a) and each time t in episode m, we consider the empirical mean estimator

r^t(s,a)=1Nt+(s,a)(q=(τ(m)-W)1t-1Rq(s,a)𝟏(sq=s,aq=a)),

which serves to estimate the average reward

r¯t(s,a)=1Nt+(s,a)(q=(τ(m)-W)1t-1rq(s,a)𝟏(sq=s,aq=a)).

The confidence region Hr,t={Hr,t(s,a)}s𝒮,a𝒜s is defined as

Hr,t(s,a)={r˙[0,1]:|r˙-r^t(s,a)|𝗋𝖺𝖽-r,t(s,a)}, (4)

with confidence radius 𝗋𝖺𝖽-r,t(s,a)=22log(SAT/δ)/Nt+(s,a). {algorithm}[!ht] SWUCRL2-CW algorithm \[email protected]@algorithmic[1] \StateInput: Time horizon T, state space 𝒮, and action space 𝒜, window size W, confidence widening parameter η. \StateInitialize t1, initial state s1. \Forepisode m=1,2, \StateSet τ(m)t, ντ(m)(s,a)0, and Nτ(m)(s,a) according to eqn (3), for all s,a.   \StateCompute the confidence regions Hr,τ(m), Hp,τ(m)(η) according to eqns (4, 6).  \StateCompute a (1/τ(m))-optimal optimistic policy π~τ(m):

𝖤𝖵𝖨(Hr,τ(m),Hp,τ(m)(η);1/τ(m))(π~τ(m),r~τ(m),p~τ(m),ρ~τ(m),γ~τ(m)).
\While

t is not a multiple of W and νm(st,π~τ(m)(st))<Nτ(m)+(st,π~τ(m)(st)) \StateChoose action at=π~τ(m)(st), observe reward Rt(st,at) and the next state st+1.\StateUpdate ντ(m)(st,at)ντ(m)(st,at)+1,tt+1.  \Ift>T \StateThe algorithm is terminated. \EndIf\EndWhile\EndFor

For each state-action pair s,a and each time step t in episode m, we consider the empirical mean estimator

p^t(s|s,a)=1Nt+(s,a)(q=(τ(m)-W)1t-1𝟏(sq=s,aq=a,sq+1=s)),

which serves to estimate the average transition probability

p¯t(s|s,a)=1Nt+(s,a)q=(τ(m)-W)1t-1pq(s|s,a)𝟏(sq=s,aq=a). (5)

Different from the case of estimating reward, the confidence region Hp,t(η)={Hp,t(s,a;η)}s𝒮,a𝒜s for the transition probability involves a widening parameter η0:

Hp,t(s,a;η)={p˙Δ𝒮:p˙(|s,a)-p^t(|s,a)1𝗋𝖺𝖽-p,t(s,a)+η}, (6)

with confidence radius 𝗋𝖺𝖽-p,t(s,a)=22Slog(SAT/δ)/Nt+(s,a). With η>0, the DM can explore state transition distributions that deviate from the sample average, and the exploration is crucial for learning MDPs under endogenous and exogenous dynamics. In a nutshell, the incorporation of η provides an additional source of optimism. We treat η as a hyper-parameter at the moment, and provide a suitable choice of η when we discuss our main results (see Theorem 1).

The SWUCRL2-CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to near-optimality. We extract and rephrase a description of EVI in Section id1 of the appendix. EVI inputs the confidence regions Hr,Hp for the rewards and the state transition distributions. The algorithm outputs an “optimistic MDP model”, which consists of reward vector r~ and state transition distribution p~ under which the optimal average gain ρ~ is the largest among all r˙Hr,p˙Hp:

  • Input: Confidence regions Hr for r, Hp for p, and an error parameter ϵ>0.

  • Output: The returned policy π~ and the auxiliary output (r~,p~,ρ~,γ~). In the latter, r~, p~, and ρ~ are the selected “optimistic” reward vector, state transition distribution, and the corresponding long term average reward. The output γ~+𝒮 is a bias vector (Jaksch et al. 2010). For each s𝒮, the quantity γ~(s) is indicative of the short term reward when the DM starts at state s and follows the optimal policy. By the design of EVI, for the output γ~, there exists s𝒮 such that γ~(s)=0. Altogether, we express

    EVI(Hr,Hp;ϵ)(π~,r~,p~,ρ~,γ~).

Combining the three components, a formal description of the SWUCRL2-CW algorithm is shown in Algorithm id1.

We now analyze the performance of the SWUCRL2-CW algorithm. First, we introduce two events r,p, which state that the estimated reward and state transition distributions lie in the respective (un-widened) confidence regions.

r={r¯t(s,a)Hr,t(s,a)s,a,t},p={p¯t(|s,a)Hp,t(s,a;0)s,a,t}.

We prove that r,p hold with high probability.

Lemma 1

We have Pr[Er]1-δ/2, Pr[Ep]1-δ/2.

The proof of Lemma 1 is provided in Section id1 of the appendix. In defining p, the widening parameter η is set to be 0, since we are only concerned with the estimation error on p. Next, we bound the dynamic regret of each time step, under certain assumptions on Hp,t(η). To facilitate our discussion, we define the following variation measure for each t in an episode m:

𝗏𝖺𝗋r,t=q=(τ(m)-W)1t-1Br,q,𝗏𝖺𝗋p,t=q=(τ(m)-W)1t-1Bp,q.
Proposition 2

Consider an episode m. Condition on events Er,Ep, and suppose that there exists a state transition distribution p satisfying two properties: (1) sSaAs, we have p(|s,a)Hp,τ(m)(s,a;η), and (2) the diameter of (S,A,p) at most D. Then, for every t{τ(m),,τ(m+1)-1} in episode m, we have

ρt*-rt(st,at) [s𝒮pt(s|st,at)γ~τ(m)(s)]-γ~τ(m)(st) (7)
+1τ(m)+ [2𝗏𝖺𝗋r,t+4D(𝗏𝖺𝗋p,t+η)]+[2𝗋𝖺𝖽-r,τ(m)(st,at)+4D𝗋𝖺𝖽-p,τ(m)(s,a)]. (8)

The proof of Proposition 2 is provided in Section id1 of the appendix. Next, we state our first main result, which provides a dynamic regret bound assuming the knowledge of Br,Bp to set W,η:

Theorem 1

Assuming S>1, the SWUCRL2-CW algorithm with window size W, confidence widening parameter η>0, and δ=T-1 satisfies the dynamic regret bound

O~(BpWη+BrW+SATW+D𝑚𝑎𝑥[BpW+SATW+Tη+SATW+T]).

If we further put W=W*=S2/3A1/2T1/2(Br+Bp)-1/2 and η=η*:=BpW*T-1, this is O~(D𝑚𝑎𝑥(Br+Bp)1/4S2/3A1/2T3/4).

The proof of Theorem 1 is provided in Section id1 of the appendix.

Remark 4 (Confidence Widening)

Similar to the regret analysis of the UCRL2 algorithm (Section 4 of (Jaksch et al. 2010)) and the UCRL2B algorithm (Lemma 3 and eqn. (10) of (Fruit et al. 2019)), Proposition 2 states that, if the confidence region Hp,τ(m)(η) contains a state transition distribution with diameter at most D, then the EVI provided with Hp,τ(m)(η) returns a policy with dynamic regret bound that grows at most linearly with D during episode m. However, as shown in Section id1 later, the parameter η has to be carefully chosen for D to be small as the worst case diameter of every state transition distribution in Hp,τ(m)(0) (i.e., setting η=0) can grow as Ω~(W), and might result in unfavorable dynamic regret bound. Here, the parameter η is the keystone of our novel confidence widening technique and the resulting dynamic regret bound: As η increases, the confidence region Hp,τ(m)(s,a;η) becomes larger for each state-action pair (s,a). Consider the first time step τ(m) of each episode m: if pτ(m)(|s,a)Hp,τ(m)(s,a;η) for all state-action pair (s,a), then Proposition 2 can be leveraged; otherwise, the widened confidence region enforces that a considerable amount of variation budget is consumed.

Remark 5 (Connections with Dynamic Regret Bounds for Non-Stationary MAB)

When S={s}, our problem becomes the non-stationary bandit problem studied by (Besbes et al. 2014), and we have D𝑚𝑎𝑥=0 and Bp=0. By choosing W=W*=A1/3T2/3/Br2/3, our algorithm has dynamic regret O~(Br1/3A1/3T2/3), matching the minimax optimal dynamic regret bound by (Besbes et al. 2014) when Br[A-1,A-1T].

Remark 6 (Connections with Regret Bounds for RL in Stationary MDPs)

When r1==rT and p1==pT, our problem becomes the RL in stationary MDPs problem studied by (Jaksch et al. 2010), and the SWUCRL2-CW algorithm with W=T and η=0 can recover the regret bound O~(DmaxSAT) of the UCRL2 algorithm studied in (Jaksch et al. 2010).

Remark 7 (Dynamic Regret Bound without Knowing Br and Bp)

Similar to (Cheung et al. 2019b, a), if Bp,Br are not known, we can set W and η obliviously as W=S23A12T12,η=W/T=S23A12T-12 to obtain a dynamic regret bound O~(D𝑚𝑎𝑥(Br+Bp+1)S2/3A1/2T3/4).

As pointed out by Remark 7, in the case of unknown Br and Bp, the dynamic regret of SWUCRL2-CW algorithm scales linearly in Br and Bp. However, by Theorem 1, we are assured a fixed pair of parameters (W*,η*) can ensure low dynamic regret. For the bandit setting, (Cheung et al. 2019a, b) propose the bandit-over-bandit framework that uses a separate copy of EXP3 algorithm to tune the window size. Inspired by it, we develop a novel Bandit-over-Reinforcement Learning (BORL) algorithm with parameter-free O~(Dmax(Br+Bp+1)1/4S2/3A1/2T3/4) dynamic regret here.

Following a similar line of reasoning as (Cheung et al. 2019a), we make use of the SWUCRL2-CW algorithm as a sub-routine, and “hedge” (Bubeck and Cesa-Bianchi 2012) against the (possibly adversarial) changes of rt’s and pt’s to identify a reasonable fixed window size and confidence widening parameter.

As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into T/H blocks of equal length H rounds (the length of the last block can H), and specifies a set J from which each pair of (window size, confidence widening) parameter are drawn from. For each block i[T/H], the BORL algorithm first calls some master algorithm to select a pair of (window size, confidence widening) parameters (Wi,ηi)(J), and restarts the SWUCRL2-CW algorithm with the selected parameters as a sub-routine to choose actions for this block. Afterwards, the total reward of block i is fed back to the master, and the “posterior” of these parameters are updated accordingly.

One immediate challenge not presented in the bandit setting (Cheung et al. 2019b) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in (Cheung et al. 2019b), and we cannot use the EXP3 (Auer et al. 2002a) algorithm as the master. Nevertheless, fortunately the state is observed before the starting of a block. Thus, we use the EXP3.P algorithm for multi-armed bandit against an adaptive adversary (Auer et al. 2002a, Bubeck and Cesa-Bianchi 2012) as the master algorithm. We follow the exposition in Section 3.2 in (Bubeck and Cesa-Bianchi 2012) for adapting the EXP3.P algorithm.

Figure 1: Structure of the BORL algorithm

We are now ready to state the details of the BORL algorithm. For some fixed choice of block length H (to be determined later), we first define a couple of additional notations:

H=3S23A12T12,Φ=12T12,ΔW=lnH,Δη=lnΦ-1,Δ=(ΔW+1)(Δη+1), (9)
JW={H0,H1ΔW,,H},Jη=S13A14×{Φ0,Φ1Δη,,Φ},J={(W,η):WJW,ηJη}.

Here, JW and Jη are all possible choices of window size and confidence widening parameter, respectively, and J is the Cartesian product of them with |J|=Δ. We also let 𝖱i(W,η,s) be the total rewards for running the SWUCRL2-CW algorithm with window size W and confidence widening parameter η for block i starting from state s,

The EXP3.P algorithm treats each element of J as an arm. It begins by initializing

α=0.95lnΔΔT/H,β=lnΔΔT/H,γ=1.05ΔlnΔT/H,q(j,k),1=0(j,k)M, (10)

where M={(j,k):j{0,1,,ΔW},k{0,1,,Δη}}. At the beginning of each block i[T/H], the BORL algorithm first sees the state s(i-1)H+1, and computes

(j,k)M,u(j,k),i=(1-γ)exp(αq(j,k),i)(j,k)Mexp(αq(j,k),i)+γΔ. (11)

Then it sets (ji,ki)=(j,k) with probability u(j,k),i(j,k)M. The selected pair of parameters are thus Wi=Hji/ΔW and ηi=Φki/Δη. Afterwards, the BORL algorithm starts from state s(i-1)H+1, selects actions by running the SWUCRL2-CW algorithm with window size Wi and confidence widening parameter ηi for each round t in block i. At the end of the block, the BORL algorithm observes the total rewards 𝖱(Wi,ηi,s(i-1)H+1). As a last step, it rescales 𝖱(Wi,ηi,s(i-1)H+1) by dividing it by H so that it is within [0,1], and updates

(j,k)Mq(j,k),i+1=q(j,k),i+β+𝟏(j,k)=(ji,ki)𝖱i(Wi,ηi,s(i-1)H+1)/Hu(j,k),i. (12)

The formal description of the BORL algorithm (with H defined in the next subsection) is shown in Algorithm the-at-equationgroup-at-IDf. {algorithm}[!ht] BORL algorithm \[email protected]@algorithmic[1] \StateInput: Time horizon T, state space 𝒮, and action space 𝒜, initial state s1. \StateInitialize H,Φ,ΔW,Δη,Δ,JW,Jη according to eqn. (9), and α,β,γ according to eqn. (10). \StateM{(j,k):j{0,1,,ΔW},k{0,1,,Δη}},q(j,k),10(j,k)M. \Fori=1,2,,T/H \StateDefine distribution (u(j,k),i)(j,k)M according to eqn. (11), and set (ji,ki)(j,k) with probability u(j,k),i.\StateWiHji/ΔW,ηiΦki/Δη. \Fort=(i-1)H+1,,iHT \StateRun the SWUCRL2-CW algorithm with window size Wi and blow up parameter ηi, and observe the total rewards 𝖱(Wi,ηi,s(i-1)H+1). \EndFor\StateUpdate q(j,k),i+1 according to eqn. (12). \EndFor

The dynamic regret guarantee of the BORL algorithm can be presented as follows

Theorem 2

Assume S>1, with probability 1-O(δ), the dynamic regret bound of the BORL algorithm is O~(Dmax(Br+Bp+1)1/4S2/3A1/2T3/4)

The proof of Theorem 2 is provided in Section id1 of the appendix.

In stochastic online learning problems, one usually estimates a latent quantity by taking the time average of observed samples, even when the sample distribution varies across time. This has been proved to work well in stationary and non-stationary bandit settings (Auer et al. 2002b, Garivier and Moulines 2011, Cheung et al. 2019a, b). To extend to RL, it is natural to consider the sample average transition distribution p^t, which uses the data in the previous W rounds to estimate the time average transition distribution p¯t to within an additive error O~(1/Nt+(s,a)) (see Section id1 and Lemma 1). In the case of stationary MDPs, where t[T]pt=p, one has p¯t=p. Thus, the un-widened confidence region Hp,t(0) contains p with high probability (see Lemma 1). Consequently, the UCRL2 algorithm by (Jaksch et al. 2010), which optimistic explores Hp,t(0), has a regret that scales linearly with the diameter of p.

The approach of optimistic exploring Hp,t(0) is further extended to RL in piecewise-stationary MDPs by (Jaksch et al. 2010, Gajane et al. 2018). The latter establishes a O(1/3Dmax2/3S2/3A1/3T2/3) dynamic regret bounds, when there are at most changes. Their analyses involve partitioning the T-round horizon into CT1/3 equal-length intervals, where C is a constant dependent on Dmax,S,A,. At least CT1/3- intervals enjoy stationary environments, and optimistic exploring Hp,t(0) in these intervals yields a dynamic regret bound that scales linearly with Dmax. Bounding the dynamic regret of the remaining intervals by their lengths and tuning C yield the desired bound.

In contrast to the stationary and piecewise-stationary settings, optimistic exploration on Hp,t(0) might lead to unfavorable dynamic regret bounds in non-stationary MDPs. In the non-stationary environment where pt-W,,pt-1 are generally distinct, we show that it is impossible to bound the diameter of p¯t in terms of the maximum of the diameters of pt-W,,pt-1. More generally, we demonstrate the previous claim not only for p¯t, but also for every p~Hp,t(0) in the following Proposition. The Proposition showcases the unique challenge in exploring non-stationary MDPs that is absent in the piecewise-stationary MDPs, and motivates our notion of confidence widening with η>0. To ease the notation, we put t=W+1 without loss of generality.

Proposition 3

There exists a sequence of non-stationary MDP transition distributions p1,,pW such that

  • The diameter of (𝒮,𝒜,pn) is 1 for each n[W].

  • The total variations in state transition distributions is O(1).

Nevertheless, under some deterministic policy,

  1. 0.

    The empirical MDP (𝒮,𝒜,p^W+1) has diameter Θ(W)

  2. 0.

    Further, for every p~Hp,W+1(0), the MDP (𝒮,𝒜,p~) has diameter Ω(W/logW)

\@trivlist

The sequence p1,,pW alternates between the following 2 instances p1,p2. Now, define the common state space 𝒮={1,2} and action collection 𝒜={𝒜1,𝒜2}, where 𝒜1={a1,a2}, {𝒜2}={b1,b2}. We assume all the state transitions are deterministic, and a graphical illustration is presented in Fig. 2. Clearly, we see that both instances have diameter 1.

Figure 2: Example MDPs. Since the transitions are deterministic, the probabilities are omitted.

Now, consider the following two deterministic and stationary policies π1: π1(1)=a1,π1(2)=b2, and π2: π2(1)=a2,π2(2)=b1. Since the MDP is deterministic, we have p^W+1=p¯W+1.

In the following, we construct a trajectory where the DM alternates between policies π1,π2 during time {1,,W} while the underlying transition distribution alternates between p1,p2. In the construction, the DM is almost always at the self-loop at state 1 (or 2) throughout the horizon, no matter what action a1,a2 (or b1,b2) she takes. Consequently, it will trick the DM into thinking that p^W+1(1|1,ai)1 for each i{1,2}, and likewise p^W+1(2|2,bi)1 for each i{1,2}. Altogether, this will lead the DM to conclude that (𝒮,𝒜,p^W+1) constitute a high diameter MDP, since the probability of transiting from state 1 to 2 (and 2 to 1) are close to 0.

The construction is detailed as follows. Let W=4τ. In addition, let the state transition distributions be

p1==pτ=p1,pτ+1==p2τ=p2,p2τ+1==p3τ=p1,p3τ+1==p4τ=p2.

The DM starts at state 1. She follows policy π1 from time 1 to time 2τ, and then policy π2 from 2τ+1 to 4τ.

Under the specified MDP models and policies, it can be readily verified that the DM takes action a1 from time 1 to τ+1, action b2 from time τ+2 to 2τ, action b1 from time 2τ+1 to 3τ+1, and action a2 from time 3τ+2 to 4τ. As a result, the DM is at state 1 from time 1 to τ+1, state 2 from time τ+2 to 3τ+1, and eventually state 1 from time 3τ+2 to 4τ as depicted in Fig. 3.

Figure 3: Illustration of the latent MDPs, policies, and state visits.

We thus have:

p^W+1(1|1,a1)=ττ+1,p^W+1(2|1,a1)=1τ+1,p^W+1(1|1,a2)=1,p^W+1(2|1,a2)=0
p^W+1(2|2,b1)=ττ+1,p^W+1(1|2,b1)=1τ+1,p^W+1(2|2,b2)=1,p^W+1(1|2,b2)=0,

and It can be readily verified that the diameter of (𝒮,𝒜,p^W+1) is τ+1=Θ(W). Finally, for the confidence region Hp,W+1(0)={Hp,W+1(s,a;0)}s,a constructed without confidence widening, for any p~Hp,W+1(0) we have

p~(2|1,a1)=p~(1|2,b1)=O(logWτ+1),p~(2|1,a2)=p~(1|2,b2)=O(logWτ-1)

respectively. Since the stochastic confidence radii Θ(logWτ+1) and Θ(logWτ-1) dominate the sample mean 1τ+1 and 0. Therefore, for any p~Hp,W+1(0), the diameter of the MDP constructed by (𝒮,𝒜,p~) is at least Ω(WlogW).  \@endparenv

Remark 8

In Proposition 3, there are two reasons for the discrepancy between the individual MDPs p1,,pW and the MDPs in the un-widened confidence region Hp,W+1(0):

  • First, due to the bandit feedback, the samples used to construct p^W+1 come from different state-action pairs at different time. As a result, p^W+1 and p¯W+1 can be very different than each of the individual state transition probability distributions p1,,pW.

  • Second, the number of visits to each state-action pair is roughly W/4, which means we would have very “narrow” confidence regions (of the order O~(1/W)) if we follow standard optimistic exploration techniques based on concentration inequalities (i.e., the confidence regions shrink as the number of samples grows).

Critically, as shown in Proposition 2 (as well as Section 4 of (Jaksch et al. 2010)) as well as Lemma 3 and eqn. (10) of (Fruit et al. 2019)), the minimum diameter of the MDPs in the confidence regions play a key role in leading to low (dynamic) regret bounds. We thus believe the caveat in learning non-stationary MDPs via conventional optimistic exploration is fundamental in general. In the current paper, we leverage our novel confidence widening technique to prevent the confidence regions from becoming too narrow even if we have lots of samples.

Remark 9

Inspecting the prevalent OFU guided approach for stochastic MAB and RL in MDPs settings (Auer et al. 2002b, Abbasi-Yadkori et al. 2011, Jaksch et al. 2010, Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018), one usually concludes that a tighter design of confidence region can result in a lower (dynamic) regret bound. In (Abernethy et al. 2016), this insights has been formalized in stochastic K-armed bandit settings via a potential function type argument. Nevertheless, Proposition 3 (together with Theorem 1) demonstrates that using the tightest confidence region in learning algorithm design may not be enough to ensure low dynamic regret bound for RL in non-stationary MDPs.

As demonstrated in previous sections, running the proposed algorithms with the widened confidence regions can help the DM to attain provably low dynamic regret in general RL in non-stationary MDPs. Nevertheless, confidence widening is not always necessary if the state transition distributions bear a special structure. In particular, we consider the following assumption on the state transition distributions p1,,pT.

Assumption 2

There exists a positive quantity (not necessarily known to the DM) ζR+, such that for any pair of states s,sS, there is an action a(s,s)As that satisfies pt(s|s,a(s,s))ζ for all t[T].

We can now analyze the dynamic regret bound of the SWUCRL2-CW algorithm under Assumption 2. Here, we follow the notations introduced in Section id1 for consistency. In general, Assumption 2 ensures that for every time step t[T], there exists a state transition distribution pHp,t(0) such that the induced diameter of the MDP (𝒮,𝒜,p) is upper bounded by the constant D¯:=1/ζ with high probability.

Proposition 4

Under Assumption 2 and conditioned on the event Ep, there exists a state transition distribution p in the confidence region Hp,t(0), such that the induced diameter of the MDP (S,A,p) is at most D¯:=1/ζ for all t[T].

The proof of Proposition 4 is provided in Section id1 of the appendix. The proposition indicates that the DM can achieve a bounded dynamic regret by implementing the SWUCRL2-CW algorithm with η=0. To analyze its dynamic regret bound, we provide a variation of Proposition 2 as follows.

Proposition 5

Consider an episode m. Conditioning on events Er,Ep, then for every t{τ(m),,τ(m+1)-1} in episode m, we have

ρt*-rt(st,at) [s𝒮pt(s|st,at)γ~τ(m)(s)]-γ~τ(m)(st)
+1τ(m)+ [2𝗏𝖺𝗋r,t+4D¯𝗏𝖺𝗋p,t]+[2𝗋𝖺𝖽-r,τ(m)(st,at)+4D¯𝗋𝖺𝖽-p,τ(m)(s,a)].

The proof is similar to that of Proposition 2 with Dτ(m) replaced by D¯ and η set to 0, respectively. We are now ready to state the dynamic regret bound of the SWUCRL2-CW algorithm when Assumption 2 holds.

Theorem 3

Under Assumption 2 and assuming S>1, the SWUCRL2-CW algorithm with window size W, confidence widening parameter η=0, and δ=T-1 satisfies the dynamic regret bound

Dyn-RegT(SWUCRL2-CW)=O~(BrW+D¯[BpW+SATW+SATW+T])

If we further put W=W*=S2/3A1/2T2/3(Br+Bp+1)-2/3, this dynamic regret bound is O~(D¯(Br+Bp+1)1/3S2/3A1/2T2/3).

We omit the proof since it is similar to that of Theorem 1.

In this subsection, we first elaborate on Assumption 2 in the context of single non-perishable item inventory control problem with zero lead time, fixed cost, and lost sales similar to (Yuan et al. 2019), and then demonstrate how to implement the SWUCRL2-CW algorithm for this problem. For each time step t[T] of the inventory control problem (with some abuse of notations), the following sequence of events happens:

  1. 0.

    The seller first observes her stock level st, and decides the quantity at to order.

  2. 0.

    If at>0, a fixed cost f and a c per-unit ordering cost are incurred, and the order arrives instantaneously. The stock level then becomes st+at.

  3. 0.

    The demand Xt is realized, and the seller observes the censored demand Yt=min{Xt,st+at}. The DM faces non-stationary demands, in the sense that the demand distributions X1,,XT at time steps 1,T are independent but not identically distributed.

  4. 0.

    Unfulfilled demand incurs a l per-unit lost sales cost, while excess inventory leads to a h per-nit holding cost. The total cost for time step t is

    Ct(st,at)=f𝟏[at>0]+cat+l[Xt-st-at]++h[st+at-Xt]+. (13)

    Due to demand censoring, the cost is not observable.

The seller’s objective is to minimize the cumulative total cost t=1TCt(st,at). To map this into the non-stationary MDP model we described in Section id1, we represent the level of stock at the beginning of each time step as the state. Same as (Yuan et al. 2019) (and similar to (Huh and Rusmevichientong 2009, Zhang et al. 2018, Agrawal and Jia 2019)), we assume the DM has a limited shelf capacity, and she can hold at most S units of inventory at any time. Consequently, 𝒮={0,,S}, and 𝒜s={0,,S-s} for each s𝒮. We also define the reward and state transition distributions for all t[T], s,s𝒮, and a𝒜s as follows,

Rt(s,a)=-Ct(s,a)   and   pt(s|s,a)=Pr(s+a-min{s+a,Xt}=s).

However, it is worth emphasizing that, different than our setup in Section id1, Rt(s,a) is not observable as Ct(s,a) is not observable. Nevertheless, we shall demonstrate in Section id1 that one could use the technique of pseudo-reward proposed in (Agrawal and Jia 2019) to bypass this issue.

Following Assumption 2, we make the strictly positive probability mass function (PMF) assumption on X1,,XT.

Assumption 3 (Strictly Positive PMF)

There is a ζ>0 such that Pr(Xt=s)ζ>0 for all t[T] and s{0,,S}.

Remark 10

It can be readily verified that if the demands satisfy the strictly positive PMF assumption, the underlying inventory control problem satisfies Assumption 2. Indeed, the DM could transit from a state sS to another state sS with probability at least ζ by ordering S-s units of the item, since then pt(s|s,S-s)=Pr(Xt=S-s)ζ.

We first compare our setting and existing ones on single non-perishable item inventory control problem with lost sales.

Similar to (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), the model presented in this section studies the single non-perishable item inventory control problem with lost sales. However, there are several key differences between ours and the existing works in terms of cost functions, demand distributions, and lead time:

Cost functions Demand distributions Lead time
Huh and Rusmevichientong (2009) linear purchasing cost without fixed cost, linear lost sales and holding cost stationary, continuous or discrete zero
Zhang et al. (2018), Agrawal and Jia (2019) no purchasing cost, linear lost sales and holding cost stationary, continuous or discrete positive
Yuan et al. (2019) linear purchasing cost with fixed cost, linear lost sales and holding cost stationary, continuous or discrete zeros
Ours linear purchasing cost with fixed cost, linear lost sales and holding cost non-stationary, discrete, with strictly positive PMF zero
Table 3: Comparisons between our inventory control model and existing works’
  • Cost Functions: In (Huh and Rusmevichientong 2009), the authors assume a linear purchasing cost function without fixed cost, linear lost sales and holding cost functions. In (Yuan et al. 2019), the authors additionally allow fixed cost. In (Zhang et al. 2018, Agrawal and Jia 2019), the authors assume the lost sales cost function and the holding cost function are linear, and there is no purchasing cost. In our setting, our cost function is the same as that of (Yuan et al. 2019).

  • Demand Distributions: In (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), the authors assume stationary demand distributions, but they admit both continuous or discrete demand distributions. In contrast, we allow non-stationary demand distributions, but we impose that the demand distribution has to be discrete, and satisfies the strictly positive PMF assumption described above.

  • Lead Time: In (Zhang et al. 2018, Agrawal and Jia 2019), the authors allow the lead time to be positive; while in (Huh and Rusmevichientong 2009, Yuan et al. 2019) and our setting, we assume the lead time is zero.

A summary of the comparisons is provided in Table 3.

As pointed out in Section id1, different than the model we present in Section id1, the reward in each time step t is not directly observable due to the censored demand. Nevertheless, we can follow the pseudo-reward technique proposed in (Agrawal and Jia 2019) to implement the SWUCRL2-CW algorithm on a sequence of suitably designed pseudo-reward distributions.

In particular, we define the pseudo-reward following (Agrawal and Jia 2019) for each time step t[T], every state s, and every action a𝒜s as

Rtpseudo(s,a):=Rt(s,a)+lXt=-f𝟏[a>0]-cat-h[s+a-Yt]++lYt,

where we recall Yt=min{s+a,Xt} is the censored demand. We note that the pseudo-reward is perfectly observable. We also define the mean pseudo-reward or each time step t[T], every state s, and every action a𝒜s as

rtpseudo(s,a):=𝔼[Rtpseudo(s,a)]=𝔼[Rt(s,a)+lXt]=rt(s,a)+l𝔼[Xt]. (14)

This indicates regardless of state and action, the mean pseudo-reward of a time step t can be obtained from shifting the corresponding mean reward uniformly by l𝔼[Xt]. Without loss of generality, we assume for all t[T], s𝒮, and a𝒜s, the mean pseudo-reward is bounded, i.e., rtpseudo(s,a)[0,1], and the pseudo-reward Rtpseudo(s,a) is 1-sub-Gaussian with mean rtpseudo(s,a). Defining ρt*pseudo as the optimal long-term average reward of the stationary MDP with state transition distribution pt and mean reward rtpseudo={rtpseudo(s,a)}s𝒮,a𝒜s, we can show that for any policy Π, the dynamic regret of the non-stationary MDP instance specified by the tuple =(𝒮,𝒜,T,r,p) and the dynamic regret of the non-stationary MDP instance specified by the tuple pseudo=(𝒮,𝒜,T,rpseudo={rtpseudo}t=1T,p) are the same.

Proposition 6

For any policy Π, we denote the sample path for following Π on M as {st(M),at(M)}t=1T, and the sample path for following Π on M𝑝𝑠𝑒𝑢𝑑𝑜 as {st(M𝑝𝑠𝑒𝑢𝑑𝑜),at(M𝑝𝑠𝑒𝑢𝑑𝑜)}t=1T, we have

t=1T{ρt*-𝔼[rt(st(),at())]}=t=1T{ρt*𝑝𝑠𝑒𝑢𝑑𝑜-𝔼[rt𝑝𝑠𝑒𝑢𝑑𝑜(st(𝑝𝑠𝑒𝑢𝑑𝑜),at(𝑝𝑠𝑒𝑢𝑑𝑜))]}.

The proof of Proposition 6 is provided in Section id1 in the appendix. Together with Theorem 3, we have the following dynamic regret bound guarantee for the SWUCRL2-CW algorithm on the the single non-perishable item inventory control problem with zero lead time, fixed cost, and lost sales.

Theorem 4

For the inventory control model in Section id1, under Assumption 3 and assuming S>1, the SWUCRL2-CW algorithm with window size W, confidence widening parameter η=0, and δ=T-1 satisfies the dynamic regret bound

Dyn-RegT(SWUCRL2-CW)=O~(BrW+D¯[BpW+S32TW+S2TW+T])

If we further put W=W*=ST2/3(Br+Bp+1)-2/3, this dynamic regret bound is O~(D¯(Br+Bp+1)1/3ST2/3).

Remark 11

To interpret the dynamic regret bound of the SWUCRL2-CW algorithm in the context of inventory control, we note that in Theorem 4, we normalize the cost functions so that the cost incurs in each time period is in [0,1]. This is slightly different than the setups in (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), where the upper bound of the cost functions are of order O(S).

As a complement to our theoretical results, we conduct numerical experiments on synthetic datasets to compare the dynamic regret performances of our algorithms with the UCRL2 algorithm (Jaksch et al. 2010), which is one of the most widely used benchmarks for RL in MDPs due to its nearly-optimal regret bound in stationary environments (Wei et al. 2019), and also the restarting UCRL2 (denoted as UCRL2.S) algorithm for RL in piecewise-stationary MDPs (Jaksch et al. 2010)

Setup: We consider a MDP with 2 states {s1,s2} and 2 actions {a1,a2}, and set T=5000. The rewards are deterministically set to

rt(s1,a1)=0.2+3cos(5Vrπt/T),rt(s1,a2)=0.2+cos(5Vrπt/T),
rt(s2,a1)=0.2-cos(5Vrπt/T),rt(s2,a2)=0.2-3cos(5Vrπt/T).

The total variations in mean rewards is thus Br=15Vr=Θ(Vr). An illustration of the reward process of state s2 and action a2 is provided in Fig. 4 (the mean rewards of other (state,action) pairs are similar).

Figure 4: Illustrations of mean rewards rt(s2,a2) (the mean rewards of other state-action pairs are similar)

The state transition distributions are set to

pt(s1|s1,a1)=1,pt(s2|s1,a1)=0,pt(s1|s1,a2)=1-βt,pt(s1|s1,a2)=βt,
pt(s1|s2,a1)=0,pt(s2|s2,a1)=1,pt(s1|s2,a2)=βt,pt(s1|s2,a2)=1-βt.

where βt is governed by the following process:

βt=0.5+0.3sin(5Vpπt/T).

The total variations in the state transition distributions is thus Bp=12Vp=Θ(Vp). In this simulation, we allow both Vr and Vp to take values from {T0.2,T0.5} to evaluate the performances of the algorithms in low and high variations scenarios. Here, we assume the SWUCRL2-CW algorithm knows the variation budgets, and the UCRL2.S algorithm restarts the UCRL2 algorithm every T2/3 time steps. All the results are averaged over 50 runs.

(a) Bp=Θ(T0.2),Br=Θ(T0.2)
(b) Bp=Θ(T0.2),Br=Θ(T0.5)
(c) Bp=Θ(T0.5),Br=Θ(T0.2)
(d) Bp=Θ(T0.5),Br=Θ(T0.5)
Figure 5: Cumulative rewards of the algorithms

Results: The cumulative rewards of the algorithms under various variation budgets are shown in Fig. 5. The results show that both the SWUCRL2-CW algorithm and the BORL algorithm are able to collect at least 20% more rewards then the UCRL2 algorithm and the UCRL2.S algorithm except for the case when Bp=Θ(T0.5) and Br=Θ(T0.2), the percentage improvement is 12%. Comparing the results in Figs. 5(a), 5(b), and 5(c), we can see that both the SWUCRL2-CW algorithm and the BORL algorithm are more robust to variations in the state transition distributions than that in reward distributions. This demonstrate the power of our confidence widening technique. Interestingly, we can see that in Figs. 5(a), 5(b), and 5(c), the cumulative rewards of the BORL algorithm (does not know the variation budgets) are higher than those of the SWUCRL2-CW algorithm (knows the variation budgets). This indeed has no contradiction to our theoretical results. Theorems 1 and 2 state that the SWUCRL2-CW algorithm and the BORL algorithm enjoy the same (in the sense of O~()) worst case dynamic regret bound. Nevertheless, the environments we construct in Fig. 4 are not the worst case scenario, and the results indicate that the adaptive master algorithm (i.e., the EXP3.P algorithm) of the BORL algorithm is able to leverage this more benign environment to attain higher rewards.

In this paper, we study the problem of un-discounted reinforcement learning in a gradually changing environment. In this setting, the parameters, i.e., the reward and state transition distributions, can be different from time to time as long as the total changes are bounded by some variation budgets, respectively. We first incorporate the sliding window estimator and the novel confidence widening technique into the UCRL2 algorithm to propose a SWUCRL2-CW algorithm with low dynamic regret when the variation budgets are known. We then design a parameter-free BORL algorithm that allows us to enjoy the same dynamic regret bound as the SWUCRL2-CW algorithm without knowing the variation budgets. The main ingredient of the proposed algorithms is the novel confidence widening technique, which injects extra optimism into the design of learning algorithms, and thus ensure low dynamic regret bounds. This is in contrast to the widely held believe that optimistic exploration algorithms for (stationary and non-stationary) stochastic online learning settings should employ the lowest possible level of optimism. To extend this finding, we also use the problem of single-item inventory control with fixed cost as an example to demonstrate how one can leverage special structures in the state transition distributions to attain low dynamic regret bound without widening the confidence region.

Acknowledgments

The authors would like to express sincere gratitude to Dylan Foster, Negin Golrezaei, and Mengdi Wang, as well as various seminar attendees for helpful discussions and comments.

References

  • Abbasi-Yadkori et al. (2013) Abbasi-Yadkori, Yasin, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvári. 2013. Online learning in markov decision processes with adversarially chosen transition probability distributions. Advances in Neural Information Processing Systems 26 (NIPS).
  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Yasin, David Pál, Csaba. Szepesvári. 2011. Improved algorithms for linear stochastic bandits. Advances Neural Information Processing Systems 25 (NIPS).
  • Abernethy et al. (2016) Abernethy, Jacob, Kareem Amin, Ruihao Zhu. 2016. Threshold bandits, with and without censored feedback. Advances in Neural Information Processing Systems 29 (NIPS).
  • Agrawal and Jia (2017) Agrawal, Shipra, Randy Jia. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems 30 (NIPS). Curran Associates, Inc., 1184–1194.
  • Agrawal and Jia (2019) Agrawal, Shipra, Randy Jia. 2019. Learning in structured mdps with convex cost functions: Improved regret bounds for inventory management. Proceedings of the ACM Conference on Economics and Computation (EC).
  • Arora et al. (2012) Arora, Raman, Ofer Dekel, Ambuj Tewari. 2012. Deterministic mdps with adversarial rewards and bandit feedback. Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. 93–101.
  • Auer et al. (2002a) Auer, P., N. Cesa-Bianchi, Y. Freund, R. Schapire. 2002a. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, Vol. 32, No. 1 : pp. 48–77.
  • Auer et al. (2002b) Auer, Peter, Nicolo Cesa-Bianchi, Paul Fischer. 2002b. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 235–256.
  • Balseiro and Gur (2019) Balseiro, Santiago, Yonatan Gur. 2019. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science.
  • Bartlett and Tewari (2009) Bartlett, Peter L., Ambuj Tewari. 2009. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009. 35–42.
  • Bertsekas (2017) Bertsekas, Dimitri. 2017. Dynamic Programming and Optimal Control. Athena Scientific.
  • Besbes et al. (2014) Besbes, Omar, Yonatan Gur, Assaf Zeevi. 2014. Stochastic multi-armed bandit with non-stationary rewards. Advances in Neural Information Processing Systems 27 (NIPS).
  • Bubeck and Cesa-Bianchi (2012) Bubeck, S., N. Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 2012, Vol. 5, No. 1: pp. 1–122.
  • Burnetas and Katehakis (1997) Burnetas, Apostolos N., Michael N. Katehakis. 1997. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, vol. 22. 222–255.
  • Cai et al. (2017) Cai, Han, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, Defeng Guo. 2017. Real-time bidding by reinforcement learning in display advertising. Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM).
  • Cardoso et al. (2019) Cardoso, Adrian Rivera, He Wang, Huan Xu. 2019. Large scale markov decision processes with changing rewards. Advances in Neural Information Processing Systems 32 (NeurIPS).
  • Chen et al. (2019a) Chen, Weidong, Cong Shi, Izak Duenyas. 2019a. Optimal learning algorithms for stochastic inventory systems with random capacities. SSRN Preprint 3287560.
  • Chen et al. (2019b) Chen, Yifang, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei. 2019b. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. Proceedings of Conference on Learning Theory (COLT).
  • Cheung et al. (2019a) Cheung, Wang Chi, David Simchi-Levi, Ruihao Zhu. 2019a. Hedging the drift: Learning to optimize under non-stationarity. arXiv:1903.01461. URL https://arxiv.org/abs/1903.01461.
  • Cheung et al. (2019b) Cheung, Wang Chi, David Simchi-Levi, Ruihao Zhu. 2019b. Learning to optimize under non-stationarity. Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Cormen et al. (2009) Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. 2009. Introduction to algorithms. MIT Press.
  • Dick et al. (2014) Dick, Travis, András György, Csaba Szepesvári. 2014. Online learning in markov decision processes with changing cost sequences. Proceedings of the International Conference on Machine Learning (ICML).
  • Even-Dar et al. (2005) Even-Dar, Eyal, Sham M Kakade, , Yishay Mansour. 2005. Experts in a markov decision process. Advances in Neural Information Processing Systems 18 (NIPS).
  • Flajolet and Jaillet (2017) Flajolet, Arthur, Patrick Jaillet. 2017. Real-time bidding with side information. Advances in Neural Information Processing Systems 30 (NeurIPS).
  • Fruit et al. (2018a) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric. 2018a. Near optimal exploration-exploitation in non-communicating markov decision processes. Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2998–3008.
  • Fruit et al. (2019) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric. 2019. Improved analysis of ucrl2b. https://rlgammazero.github.io/.
  • Fruit et al. (2018b) Fruit, Ronan, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner. 2018b. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80. PMLR, 1578–1586.
  • Gajane et al. (2018) Gajane, Pratik, Ronald Ortner, Peter Auer. 2018. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions.
  • Garivier and Moulines (2011) Garivier, Aurélien, Eric Moulines. 2011. On upper-confidence bound policies for switching bandit problems. Algorithmic Learning Theory. Springer Berlin Heidelberg, 174–188.
  • Google (2011) Google. 2011. The arrival of real-time bidding. URL https://www.rtbchina.com/wp-content/uploads/2012/03/Google-White-Paper-The-Arrival-of-Real-Time-Bidding-July-2011.pdf.
  • Guo et al. (2019) Guo, Xin, Anran Hu, Renyuan Xu, Junzi Zhang. 2019. Learning mean-field games. Advances in Neural Information Processing Systems 32 (NeurIPS).
  • Hoeffding (1963) Hoeffding, Wassily. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, vol. 58. Taylor & Francis Group, 13–30.
  • Huh and Rusmevichientong (2009) Huh, Woonghee Tim, Paat Rusmevichientong. 2009. A nonparametric asymptotic analysis of inventory planning with censored demand. Mathematics of Operations Research.
  • Jaksch et al. (2010) Jaksch, Thomas, Ronald Ortner, Peter Auer. 2010. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., vol. 11. JMLR.org, 1563–1600.
  • Jin et al. (2019) Jin, Chi, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu. 2019. Learning adversarial markov decision processes with bandit feedback and unknown transition.
  • Karnin and Anava (2016) Karnin, Z., O. Anava. 2016. Multi-armed bandits: Competing with optimal sequences. Advances in Neural Information Processing Systems 29 (NIPS).
  • Keskin and Zeevi (2016) Keskin, N., A. Zeevi. 2016. Chasing demand: Learning and earning in a changing environments. Mathematics of Operations Research, 2016, 42(2), 277–307.
  • Kim and Lim (2016) Kim, Michael Jong, Andrew E.B. Lim. 2016. Robust multiarmed bandit problems. Management Science.
  • Kiss et al. (2017) Kiss, Istvan Z., Joel C. Miller, Peter L. Simon. 2017. Mathematics of epidemics on networks. Springer.
  • Kveton et al. (2015) Kveton, Branislav, Zheng Wen, Azin Ashkan, Csaba Szepesvári. 2015. Tight regret bounds for stochastic combinatorial semi-bandits. AISTATS.
  • Lattimore and Szepesvári (2018) Lattimore, T., C. Szepesvári. 2018. Bandit Algorithms. Cambridge University Press.
  • Li et al. (2019) Li, Yingying, Aoxiao Zhong, Guannan Qu, Na Li. 2019. Online markov decision processes with time-varying transition probabilities and rewards. ICML workshop on Real-world Sequential Decision Making.
  • Luo et al. (2018) Luo, H., C. Wei, A. Agarwal, J. Langford. 2018. Efficient contextual bandits in non-stationary worlds. Proceedings of Conference on Learning Theory (COLT).
  • Ma (2018) Ma, Will. 2018. Improvements and generalizations of stochastic knapsack and markovian bandits approximation algorithms. Mathematics of Operations Research.
  • Manheim (2020) Manheim. 2020. Online. URL https://www.manheim.com/. [Last accessed February 10, 2020].
  • Neu et al. (2010) Neu, Gergely, Andras Antos, András György, Csaba Szepesvári. 2010. Online markov decision processes under bandit feedback. Advances in Neural Information Processing Systems 23 (NIPS). 1804–1812.
  • Neu et al. (2012) Neu, Gergely, Andras Gyorgy, Csaba Szepesvari. 2012. The adversarial stochastic shortest path problem with unknown transition probabilities. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, vol. 22. PMLR, 805–813.
  • Nowzari et al. (2016) Nowzari, Cameron, Victor M. Preciado, George J. Pappas. 2016. Analysis and control of epidemics: A survey of spreading processes on complex networks. IEEE Control Systems Magazine.
  • Ortner et al. (2019) Ortner, Ronald, Pratik Gajane, Peter Auer. 2019. Variational regret bounds for reinforcement learning. Proceedings of Conference on Uncertainty in Artificial Intelligence (UAI).
  • Puterman (1994) Puterman, Martin L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st ed. John Wiley & Sons, Inc., New York, NY, USA.
  • Qin et al. (2019) Qin, Zhiwei (Tony), Jian Tang, Jieping Ye. 2019. Deep reinforcement learning with applications in transportation. Tutorial of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19).
  • Rosenberg and Mansour (2019) Rosenberg, Aviv, Yishay Mansour. 2019. Online convex optimization in adversarial Markov decision processes. Proceedings of the 36th International Conference on Machine Learning, vol. 97. PMLR, 5478–5486.
  • Shortreed et al. (2010) Shortreed, Susan, Eric Laber, Daniel Lizotte, Scott Stroup, Joelle Pineau Susan Murphy. 2010. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine Learning.
  • Sidford et al. (2018a) Sidford, Aaron, Mengdi Wang, Xian Wu, Lin F. Yang, Yinyu Ye. 2018a. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. Advances in Neural Information Processing Systems 31 (NeurIPS).
  • Sidford et al. (2018b) Sidford, Aaron, Mengdi Wang, Xian Wu, Yinyu Ye. 2018b. Variance reduced value iteration and faster algorithms for solving markov decision processes. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA).
  • Sutton and Barto (2018) Sutton, Richard S., Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book.
  • Tewari and Bartlett (2008) Tewari, Ambuj, Peter L. Bartlett. 2008. Optimistic linear programming gives logarithmic regret for irreducible mdps. Advances in Neural Information Processing Systems 20 (NIPS). 1505–1512.
  • Vehicle Remarketing (2020) Vehicle Remarketing. 2020. Online. URL https://www.vehicleremarket.com/. [Last accessed February 10, 2020].
  • Wang (2019) Wang, Mengdi. 2019. Randomized linear programming solves the markov decision problem in nearly-linear (sometimes sublinear) running time. Mathematics of Operations Research.
  • Wei et al. (2019) Wei, Chen-Yu, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain. 2019. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. arXiv:1910.07072.
  • Xu and Yun (2020) Xu, Kuang, Se-Young Yun. 2020. Reinforcement with fading memories. Mathematics of Operations Research.
  • Yu and Mannor (2009) Yu, Jia Yuan, Shie Mannor. 2009. Online learning in markov decision processes with arbitrarily changing rewards and transitions. Proceedings of the International Conference on Game Theory for Networks.
  • Yu et al. (2009) Yu, Jia Yuan, Shie Mannor, Nahum Shimkin. 2009. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, vol. 34. 737–757.
  • Yuan et al. (2019) Yuan, Hao, Qi Luo, Cong Shi. 2019. Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. SSRN Preprint 3329611.
  • Zhang and Wang (2018) Zhang, Anru, Mengdi Wang. 2018. Spectral state compression of markov processes. arXiv:1802.02920.
  • Zhang et al. (2018) Zhang, Huanan, Xiuli Chao, Cong Shi. 2018. Closing the gap: A learning algorithm for the lost-sales inventory system with lead times. Management Science.
  • Zhang and Ji (2019) Zhang, Zihan, Xiangyang Ji. 2019. Regret minimization for reinforcement learning by evaluating the optimal bias function. Advances in Neural Information Processing Systems 32 (NeurIPS).
  • Zhou et al. (2020) Zhou, Xiang, Ningyuan Chen, Xuefeng Gao, Yi Xiong. 2020. Regime switching bandits. arXiv:2001.09390.
  • Zhou and Bambos (2015) Zhou, Zhengyuan, Nicholas Bambos. 2015. Wireless communications games in fixed and random environments. IEEE Conference on Decision and Control (CDC).
  • Zhou et al. (2016) Zhou, Zhengyuan, Peter Glynn, Nicholas Bambos. 2016. Repeated games for power control in wireless communications: Equilibrium and regret. IEEE Conference on Decision and Control (CDC).

Appendix. Supplementary

The optimal long term reward ρt* is equal to the optimal value of the linear program 𝖯(rt,pt) (Puterman 1994). For a reward vector r and a transition distribution p, we define

𝖯(r,p):max s𝒮,a𝒜sr(s,a)x(s,a) (15)
s.t. a𝒜sx(s,a)=s𝒮,a𝒜sp(s|s,a)x(s,a) s𝒮
s𝒮,a𝒜sx(s,a)=1
x(s,a)0 s𝒮,a𝒜s

Throughout our analysis, it is useful to consider the following dual formulation 𝖣(r,p) of the optimization problem 𝖯(r,p):

𝖣(r,p):min ρ (16)
s.t. ρ+γ(s)r(s,a)+s𝒮p(s|s,a)γ(s) s𝒮,a𝒜s
ϕ,γ(s) free s𝒮.

The following Lemma shows that any feasible solution to 𝖣(r,p) is essentially bounded if the underlying MDP is communicating, which will be crucial in the subsequent analysis.

Lemma 2

Let (ρ,γ) be a feasible solution to the dual problem 𝖣(r,p), where (S,A,p) consititute a communicating MDP with diameter D. We have

maxs,s𝒮{γ(s)-γ(s)}2D.

The Lemma is extracted from Section 4.3.1 of (Jaksch et al. 2010), and it is more general than (Lattimore and Szepesvári 2018), which requires (ρ,γ) to be optimal instead of just feasible.

We begin with invoking Lemma 2, which guarantees that for each t there is an optimal solution (ρt*,γt*) of 𝖣(rt,pt) that satisfies 0γt*(s)2Dmax for all s𝒮. Recall for each t:

Br,t=maxs𝒮,a𝒜s|rt+1(s,a)-rt(s,a)|,Bp,t=maxs𝒮,a𝒜spt+1(|s,a)-pt(|s,a)1. (17)

Consider two time indexes tτ. We first claim the following two inequalities:

ρτ* ρt*-q=tτ-1(Br,q+2DmaxBp,q) (18)
ρt* rτ(sτ,aτ)+[s𝒮pτ(s|sτ,aτ)γt*(s)-γt*(sτ)]-q=tτ-1(Br,q+2DmaxBp,q). (19)

The proofs of inequalities (18, 19) are deferred to the end. Now, combining (18, 19) gives

ρτ*rτ(sτ,aτ)+[s𝒮pτ(s|sτ,aτ)γt*(s)-γt*(sτ)]-2q=tτ-1(Br,q+2DmaxBp,q). (20)

Let positive integer WT be a window size, which is specified later. Summing (20) over τ=t,,t+W-1 and taking expectation over {(sτ,aτ)}τ=tt+W-1 yield

τ=tt-W+1ρτ*𝔼[τ=tt-W+1rτ(sτ,aτ)]+𝔼[τ=tt-Wpτ(s|sτ,aτ)γt*(s)-γt*(sτ+1)] (21)
+ 𝔼[s𝒮pt-W+1(s|st-W+1,at-W+1)γt*(s)-γt*(st)]-2τ=tt-W+1q=tτ-1(Br,q+2DmaxBp,q) (22)
𝔼[τ=tt-W+1rτ(sτ,aτ)]-2Dmax-2Wq=tt+W-1(Br,q+2DmaxBp,q). (23)

To arrive at (23), note that the second expectation in (21), which is a telescoping sum, is equal to 0, since sτ+1 is distributed as p(|sτ,aτ). In addition, we trivially lower bound the first expectation in (22) by -2Dmax by applying Lemma 2. Next, consider partitioning the horizon of T steps into intervals of W time steps, where last interval could have less than W time steps. That is, the first interval is {1,,W}, the second is {W+1,,2W}, and so on. Applying the bound (23) on each interval and summing the resulting bounds together give

t=1Tρt* 𝔼[t=1Trt(st,at)]-2TWDmax-2Wt=1T(Br,t+2DmaxBp,t)
𝔼[t=1Trt(st,at)]-4TDmaxW-2W(Br+2DmaxBp). (24)

Choosing W to be any integer in [T/(Br+2DmaxBp),2T/(Br+2DmaxBp)] yields the desired inequality in the Theorem. Finally, we go back to proving inequalities (18,19). These inequalities are clearly true when t=τ, so we focus on the case t<τ.

Proving inequality (18). It suffices to show that the solution (ρτ*+q=tτ-1(Br,q+2DmaxBp,q),γτ*) is feasible to the linear program D(rt,pt). To see the feasibility, it suffices to check the constraint of D(rt,pt) for each state-action pair s,a:

ρτ*+q=tτ-1(Br,q+2DmaxBp,q)
[rτ(s,a)+q=tτ-1Br,q]+[-γτ*(s)+s𝒮pτ(s|s,a)γτ*(s)+q=tτ-12DmaxBp,q].

The feasibility is proved by noting that

|rτ(s,a)-rt(s,a)| q=tτ-1Br,q, (25)
|s𝒮pτ(s|s,a)γτ*(s)-s𝒮pt(s|s,a)γτ*(s)| pτ(|s,a)-pt(|s,a)1γτ*
q=tτ-1Bp,q(2Dmax). (26)

Proving inequality (19). We have

ρt* rt(sτ,aτ)+s𝒮pt(s|sτ,aτ)γt*(s)-γt*(sτ)
rτ(sτ,aτ)+s𝒮pt(s|sτ,aτ)γt*(s)-γt*(sτ)-s=tτ-1Br,s (27)
rτ(sτ,aτ)+s𝒮pτ(s|sτ,aτ)γt*(s)-γt*(sτ)-s=tτ-1Br,s-2Dmaxs=tτ-1Bp,s, (28)

where steps (27, 28) are by inequalities (25, 26). Altogether, the Proposition is proved.

{algorithm}

[!ht] EVI(Hr,Hp;ϵ), mostly extracted from (Jaksch et al. 2010) \[email protected]@algorithmic[1] \StateInitialize VI record u0𝒮 as u0(s)=0 for all s𝒮. \Fori=0,1, \StateFor each s𝒮, compute VI record ui+1(s)=maxa𝒜sΥ~i(s,a), where

Υ~i(s,a)=maxr˙(s,a)Hr(s,a){r˙(s,a)}+maxp˙Hp(s,a){s𝒮ui(s)p˙(s)}.
\State

Define stationary policy π~:𝒮𝒜s as π~(s)=argmaxa𝒜sΥ~i(s,a). \StateDefine optimistic reward r~={r~(s,a)}s,a with r~(s,a)argmaxr˙(s,a)Hr(s,a){r˙(s,a)}. \StateDefine optimistic distribution p~={p~(|s,a)}s,a with p~(|s,a)argmaxp˙Hp(s,a){s𝒮ui(s)p˙(s)}. \StateDefine optimistic dual variables ρ~=maxs𝒮{ui+1(s)-ui(s)}, γ~(s)=ui(s)-mins𝒮ui(s). \Ifmaxs𝒮{ui+1(s)-ui(s)}-mins𝒮{ui+1(s)-ui(s)}ϵ \StateBreak the for loop. \EndIf\EndFor\StateReturn policy π~. \StateAuxiliary output: optimistic reward and state transition distributions (r~,p~), optimistic dual variables (ρ~,γ~). We provide the pseudo-codes of EVI(Hr,Hp;ϵ) proposed by (Jaksch et al. 2010) in Algorithm id1. By (Jaksch et al. 2010), the algorithm converges in finite time when the confidence region Hp contains a transition distribution p such that (𝒮,𝒜,p) constitutes a communicating MDP. The output (π~,r~,p~,ρ~,γ~) of the EVI(Hr,Hp;ϵ) satisfies the following two properties (Jaksch et al. 2010).

Property 1

The dual variables (ρ~,γ~) are optimistic, i.e.,

ρ~+γ~(s)maxr˙(s,a)Hr(s,a){r˙(s,a)}+s𝒮γ~(s)maxp˙Hp(s,a){p˙(s|s,a)}.
Property 2

For each state sS, we have

r~(s,π~(s))ρ~+γ~(s)-s𝒮p~(s|s,π~(s))γ~(s)-ϵ.

Property 1 ensures the feasibility of the output dual variables (ρ~,γ~), with respect to the dual program 𝖣(r˙,p˙) for any r˙,p˙ in the confidence regions Hr,Hp. The feasibility facilitates the bounding of maxs𝒮γ~(s), which turns out to be useful for bounding the regret arise from switching among different stationary policies. To illustrate, suppose that Hp is so large that it contains a transition distribution p˙ under which (𝒮,𝒜,p˙) has diameter D. By Lemma 2, we have 0maxs𝒮γ~(s)2D.

Property 2 ensures the near-optimality of the dual variables (ρ~,γ~) to the (r~,p~) optimistically chosen from Hr,Hp. More precisely, the deterministic policy π~ near-optimal for the MDP with time homogeneous reward function r~ and time homogeneous transition distribution p~, under which the policy π~ achieves a long term average reward is at least ρ~*-ϵ.

We employ the self-normalizing concentration inequallity (Abbasi-Yadkori et al. 2011). The following inequality is extracted from Theorem 1 in (Abbasi-Yadkori et al. 2011), restricted to the case when d=1.

Proposition 7 ((Abbasi-Yadkori et al. 2011))

Let {Fq}q=1T be a filtration. Let {ξq}q=1T be a real-valued stochastic process, such that ξq is Fq-measurable, and ξq is conditionally R-sub-Gaussian, i.e. for all λ0, it holds that E[exp(λξq)|Fq-1]exp(λ2R2/2). Let {Yq}q=1T be a non-negative real-valued stochastic process such that Yq is Fq-1-measurable. For any δ(0,1), it holds that

Pr(q=1tξqYqmax{1,q=1tYq2}2Rlog(T/δ)max{1,q=1tYq2} for all t[T])1-δ.

In particular, if {Yq}q=1T be a {0,1}-valued stochastic process, then for any δ(0,1), it holds that

Pr(q=1tξqYqmax{1,q=1tYq}2Rlog(T/δ)max{1,q=1tYq} for all t[T])1-δ. (29)

The Lemma is proved by applying Proposition 7 with suiatable choices of q=1T,{ξq}q=1T,{Yq}q=1T,δ. We divide the proof into two parts.

It suffices to prove that, for any fixed s𝒮,a𝒜s,t[T], it holds that

Pr(|r^(s,a)-r¯t(s,a)|𝗋𝖺𝖽-r,t(s,a))
= Pr(|1Nt+(s,a)q=(τ(m)-W)1t-1[Rq(s,a)-rq(s,a)]𝟏(sq=s,aq=a)|2log(2SAT2/δ)Nt+(s,a))
1-δ2SAT. (30)

since then Pr[r]1-δ/2 follows from the union bound over all s𝒮,a𝒜s,t[T]. Now, the trajectory of the online algorithm is expressed as {sq,aq,Rq}q=1T. Inequality (30) directly follows from Proposition 7, with {q}q=1T,{ξq}q=1T,{Yq}q=1T,δ defined as

q ={(s,a,R)}=1q{(sq+1,aq+1)},
ξq =Rq(s,a)-rq(s,a),
Yq =𝟏(sq=s,aq=a,((t-W)1)qt-1),
δ =δ2SAT.

Each ξq is conditionally 2-sub-Gaussian, since -1ξq1 with certainty. Altogether, the required inequality is shown.

We start by noting that, for two probability distributions p,{p(s)}s𝒮,p={p(s)}s𝒮 on 𝒮, it holds that

p-p1=maxθ{-1,1}𝒮θ(s)(p(s)-p(s)).

Consequently, to show Pr[p]1-δ/2, it suffices to show that, for any fixed s𝒮,a𝒜s,t[T],θ{-1,1}𝒮, it holds that

Pr(s𝒮θ(s)(p^t(s|s,a)-p¯t(s|s,a))𝗋𝖺𝖽-p,t(s,a))
Pr(1Nt+(s,a)q=(τ(m)-W)1t-1[s𝒮θ(s)𝟏(sq=s,aq=a,sq+1=s)]
  -[s𝒮θ(s)pq(s|s,a)𝟏(sq=s,aq=a)]2log(2SAT22S/δ)Nt+(s,a))
1-δ2SAT2S, (31)

since then the required inequality follows from a union bound over all s𝒮,a𝒜s,t[T],θ{-1.1}𝒮. Similar to the casea of r, (31) follows from Proposition 7, with {q}q=1T,{ξq}q=1T,{Yq}q=1T,δ defined as

q ={(s,a)}=1q+1,
ξq =[s𝒮θ(s)𝟏(sq=s,aq=a,sq+1=s)]-[s𝒮θ(s)pq(s|s,a)],
Yq =𝟏(sq=s,aq=a,((t-W)1)qt-1),
δ =δ2SAT2S.

Each ξq is conditionally 2-sub-Gaussian, since -1ξq1 with certainty. Altogether, the required inequality is shown.

In this section, we prove Proposition 2. Throughout the section, we impose the assumptions stated by the Proposition. That is, the events r,p hold, and there exists p with (1) pHp,τ(m)(η), (2) (𝒮,𝒜,p) has diameter at most D. We begin by recalling the following notations:

Br,t=maxs𝒮,a𝒜s|rt+1(s,a)-rt(s,a)| ,Bp,t=maxs𝒮,a𝒜spt+1(|s,a)-pt(|s,a)1,
𝗏𝖺𝗋r,t=q=τ(m)-Wt-1Br,q ,𝗏𝖺𝗋p,t=q=τ(m)-Wt-1Bp,q.

We then need the following auxiliary lemmas

Lemma 3

Let t be in episode m. For every state-action pair (s,a), we have

|rt(s,a)-r¯τ(m)(s,a)|𝗏𝖺𝗋r,t,pt(|s,a)-p¯τ(m)(|s,a)1𝗏𝖺𝗋p,t
Lemma 4

Let t be in episode m. We have

ρ~τ(m)ρt*-𝗏𝖺𝗋r,t-2D𝗏𝖺𝗋p,t.
Lemma 5

Let t be in episode m. For every state-action pair (s,a), we have

|s𝒮p~τ(m)(s|s,a)γ~τ(m)(s)-s𝒮pt(s|s,a)γ~τ(m)(s)|2D[𝗏𝖺𝗋p,t+2𝗋𝖺𝖽-p,τ(m)(s,a)+η].

Lemmas 3, 4, 5 are proved in Sections id1, id1, and id1, respectively.

We first provide the bound for rewards:

|rt(s,a)-r¯τ(m)(s,a)| |rt(s,a)-rτ(m)(s,a)|+|rτ(m)(s,a)-r¯τ(m)(s,a)|
q=τ(m)t-1|rq+1(s,a)-rq(s,a)|+1Ww=1W|rτ(m)(s,a)-rτ(m)-w(s,a)|.

By the definition of Br,q, we have

q=τ(m)t-1|rq+1(s,a)-rq(s,a)|q=τ(m)t-1Br,q,

and

1Ww=1W|rτ(m)(s,a)-rτ(m)-w(s,a)| 1Ww=1Wi=1w|rτ(m)-i+1(s,a)-rτ(m)-i(s,a)|
1Ww=1Wi=1W|rτ(m)-i+1(s,a)-rτ(m)-i(s,a)|
=i=1W|rτ(m)-i+1(s,a)-rτ(m)-i(s,a)|i=1WBr,τ(m)-i.

Next, we provide a similar analysis on the transition distribution.

pt(s,a)-p¯τ(m)(s,a)1 pt(s,a)-pτ(m)(s,a)1+pτ(m)(s,a)-p¯τ(m)(s,a)1
q=τ(m)t-1pq+1(s,a)-pq(s,a)1+1Ww=1Wpτ(m)(s,a)-pτ(m)-w(s,a)1.

By the definition of Bp,q, we have

q=τ(m)t-1pq+1(s,a)-pq(s,a)1q=τ(m)t-1Bp,q,

and

1Ww=1Wpτ(m)(s,a)-pτ(m)-w(s,a)1 1Ww=1Wi=1wpτ(m)-i+1(s,a)-pτ(m)-i(s,a)1
1Ww=1Wi=1Wpτ(m)-i+1(s,a)-pτ(m)-i(s,a)1
=i=1Wpτ(m)-i+1(s,a)-pτ(m)-i(s,a)1i=1WBp,τ(m)-i.

Altogether, the lemma is shown.

We first demonstrate two immediate consequences about the dual solution (ρ~τ(m),γ~τ(m)) by the Proposition’s assumptions:

0γ~τ(m)(s)2D for all s𝒮, (32)
ρ~τ(m)+γ~τ(m)(s)r¯τ(m)(s,a)+s𝒮γ~τ(m)(s)p¯τ(m)(s|s,a) for all s𝒮,a𝒜s. (33)

To see inequality (32), first observe that

ρ~τ(m)+γ~τ(m)(s) maxr˙(s,a)Hr,τ(m)(s,a){r˙(s,a)}+s𝒮γ~τ(m)(s)maxp˙Hp,τ(m)(s,a;η){p˙(s|s,a)} (34)
maxr˙(s,a)Hr,τ(m)(s,a){r˙(s,a)}+s𝒮γ~τ(m)(s)p(s|s,a). (35)

Step (34) is by Property 1 of the output from EVI, which is applied with confidence regions Hr,τ(m),Hp,τ(m)(η). Step (35) is because of the assumption that pHp,τ(m)(η). Altogether, the solution (ρ~τ(m),γ~τ(m)) is feasible to 𝖣(r˙,p) for any r˙Hr,τ(m). Now, by Lemma 2, we have maxs,s𝒮|γ~τ(m)(s)-γ~τ(m)(s)|2D. Finally, inequality (32) follows from the fact that the bias vector γ~τ(m) returned by EVI is component-wise non-negative, and there exists s𝒮 such that γ~τ(m)=0.

To see inequality (33), observe that

ρ~τ(m)+γ~τ(m)(s) maxr˙(s,a)Hr,τ(m)(s,a){r˙(s,a)}+s𝒮γ~τ(m)(s)maxp˙Hp,τ(m)(s,a;η){p˙(s|s,a)} (36)
r¯τ(m)(s,a)+s𝒮γ~τ(m)(s)p¯τ(m)(s|s,a). (37)

Step (36) is again by Property 1 of the output from EVI, and step (37) is by the assumptions that r¯τ(m)Hr,τ(m), and p¯τ(m)Hp,τ(m)(0)Hp,τ(m)(η).

Now, we claim that (ρ~τ(m)+𝗏𝖺𝗋r,t+2D𝗏𝖺𝗋p,t,γ~τ(m)) is a feasible solution to the tth period dual problem 𝖣(rt,pt), which immediately implies the Lemma. To demonstrate the claim, for every state-action pair (s,a) we have

r¯τ(m)(s,a) rt(s,a)-𝗏𝖺𝗋r,t (38)
s𝒮γ~τ(m)(s)pτ(m)(s|s,a) s𝒮γ~τ(m)(s)pt(s|s,a)-γ~τ(m)pt(|s,a)-p¯τ(m)(|s,a)1
s𝒮γ~τ(m)(s)pt(s|s,a)-2D𝗏𝖺𝗋p,t,. (39)

Inequality (38) is by Lemma 3 on the rewards. Step (39) is by inequality (32), and by Lemma 3 which shows pt(|s,a)-p¯τ(m)(|s,a)1𝗏𝖺𝗋p,t. Altogether, putting (38), (39) to inequality (33), our claim is shown, i.e., for all s𝒮 and a𝒜s,

ρ~τ(m)+𝗏𝖺𝗋r,t+2D𝗏𝖺𝗋p,t+γ~τ(m)(s)rt(s,a)+s𝒮γ~τ(m)(s)pt(s|s,a).

Hence, the lemma is proved.

We have

|s𝒮[p~τ(m)(s|s,a)-pt(s|s,a)]γ~τ(m)(s)|
γ~τ(m)(a)[p~τ(m)(|s,a)-p¯τ(m)(|s,a)1(b)+p¯τ(m)(|s,a)-pt(|s,a)1(c)]. (40)

In step (40), we know that

  • (a)2D by inequality (32),

  • (b)2𝗋𝖺𝖽-p,τ(m)(s,a)+η, by the facts that p~τ(m)(|s,a)Hp,τ(m)(s,a;η) and p¯τ(m)(|s,a)Hp,τ(m)(s,a;0),

  • (c)𝗏𝖺𝗋p,t by Lemma 3 on the bound on p.

Altogether, the Lemma is proved.

Now, we have

rt(st,at)
r¯τ(m)(st,at)-𝗏𝖺𝗋r,t (41)
r~τ(m)(st,at)-𝗏𝖺𝗋r,t-2𝗋𝖺𝖽-r,τ(m)(st,at) (42)
ρ~τ(m)+γ~τ(m)(st)-[s𝒮p~τ(m)(s|st,at)γ~τ(m)(s)]-1τ(m)
      -𝗏𝖺𝗋r,t-2𝗋𝖺𝖽-r,τ(m)(st,at) (43)
ρt*+γ~τ(m)(st)-[s𝒮pt(s|st,at)γ~τ(m)(s)]-1τ(m)
-2[𝗏𝖺𝗋r,t+𝗋𝖺𝖽-r,τ(m)(st,at)]-2D[2𝗏𝖺𝗋p,t+2𝗋𝖺𝖽-p,τ(m)(s,a)+η]. (44)

Step (41) is by Lemma 3 on t. Step (42) is by conditioning that event r holds. Step (43) is by Property 2 for the output of EVI. In step (44), we upper bound ρ~τ(m) by Lemma 4 and we upper bound s𝒮p~τ(m)(s|st,at)γ~τ(m)(s) by Lemma 5. Rearranging gives the Proposition.

To facilitate the exposition, we denote M(T) as the total number of episodes. By abusing the notation we, let τ(M(T)+1)-1=T. Episode M(T), containing the final round T, is interrupted and the algorithm is forced to terminate as the end of time T is reached. We can now rewrite the dynamic regret of the SWUCRL2-CW algorithm as the sum of dynamic regret from each episode:

Dyn-RegT(SWUCRL2-CW)=t=1T(ρt*-rt(st,at))=m=1M(T)t=τ(m)τ(m+1)-1(ρt*-rt(st,at)) (45)

To proceed, we define the set

U={m[M(T)]:pτ(m)(|s,a)Hp,τ(m)(s,a;η)(s,a)𝒮×𝒜s}.

For each episode m[M(T)], we distinguish two cases:

  • Case 1. mU. Under this situation, we apply Proposition 2 to bound the dynamic regret during the episode, using the fact that pτ(m) satisfies the assumptions of the proposition with D=Dτ(m)Dmax.

  • Case 2. m[M(T)]U. In this case, we trivially upper bound the dynamic regret of each round in episode m by 1.

For case 1, we bound the dynamic regret during episode m by summing the error terms in (7, 8) across the rounds t[τ(m),τ(m+1)-1] in the episode. The term (7) accounts for the error by switching policies. In (8), the terms 𝗋𝖺𝖽-r,τ(m),𝗋𝖺𝖽-p,τ(m) accounts for the estimation errors due to stochastic variations, and the term 𝗏𝖺𝗋r,t,𝗏𝖺𝗋p,t accounts for the estimation error due to non-stationarity.

For case 2, we need an upper bound on m[M(T)]Ut=τ(m)τ(m+1)-11, the total number of rounds that belong to an episode in [M(T)]U. The analysis is challenging, since the length of each episode may vary, and one can only guarantee that the length is W. A first attempt could be to upper bound as m[M(T)]Ut=τ(m)τ(m+1)-11Wm[M(T)]U1, but the resulting bound appears too loose to provide any meaningful regret bound. Indeed, there could be double counting, as the starting time steps for a pair of episodes in case 2 might not even be W rounds apart!

To avoid the trap of double counting, we consider a set QT[M(T)]U where the start times of the episodes are sufficiently far apart, and relate the cardinality of QT to m[M(T)]Ut=τ(m)τ(m+1)-11. The set QT[M(T)] is constructed sequentially, by examining all episodes m=1,,M(T) in the time order. At the start, we initialize QT=. For each m=1,,M(T), we perform the following. If episode m satisfies both criteria:

  1. 0.

    There exists some s𝒮 and a𝒜s such that pτ(m)(|s,a)Hp,τ(m)(s,a;η);

  2. 0.

    For every mQT, τ(m)-τ(m)>W,

then we add m into QT. Afterwards, we move to the next episode index m+1. The process terminates once we arrive at episode M(T)+1. The construction ensures that, for each episode m[M(T)], if τ(m)-τ(m)[0,W] for all mQT, then s𝒮a𝒜spτ(m)(|s,a)Hp,τ(m)(s,a); otherwise, m would have been added into QT.

We further construct a set Q~T to include all elements in QT and every episode index m such that there exists mQT with τ(m)-τ(m)[0,W]. By doing so, we can prove that every episode m[M(T)]Q~T satisfies pτ(m)(|s,a)Hp,τ(m)(s,a)s𝒮a𝒜s. The procedures for building Q~T (initialized to QT) are described as follows: for every episode index m[M(T)], if there exists mQT, such that τ(m)-τ(m)[0,W], then we add m to Q~T. Formally,

Q~T=QT{m[M(T)]:mQTτ(m)-τ(m)[0,W]}.
Figure 6: Both episodes mi and mi+4 belong to QT (and thus Q~T) because pτ(mi)Hp,τ(mi)(η) and pτ(mi+4)Hp,τ(mi+4)(η). mi+1 is added to Q~T (but not QT) because τ(mi+1)-τ(mi)[0,W]. mi+2 and mi+3 belong to neither of QT nor Q~T as pτ(mi+2)Hp,τ(mi+2)(η) and pτ(mi+3)Hp,τ(mi+3)(η).

We can formalize the properties of QT and Q~T as follows.

Lemma 6

Conditioned on Ep, |QT|Bp/η.

Lemma 7

For any episode mQ~T, we have pτ(m)(|s,a)Hp,τ(m)(s,a;η) for all sS and aAs.

The proofs of Lemmas 6 and 7 are presented in Sections id1 and id1, respectively.

Together with eqn. (45), we can further decompose the dynamic regret of the SWUCRL2-CW algorithm as

Dyn-RegT(SWUCRL2-CW)
= mQ~Tt=τ(m)τ(m+1)-1(ρt*-rt(st,at))+m[MT]Q~Tt=τ(m)τ(m+1)-1(ρt*-rt(st,at))
mQ~Tt=τ(m)τ(m+1)-1(ρt*-rt(st,at)) ()
+m[MT]Q~Tt=τ(m)τ(m+1)-1{[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]+1τ(m)} ()
+m[MT]Q~Tt=τ(m)τ(m+1)-1(2𝗏𝖺𝗋r,t+4Dmax𝗏𝖺𝗋p,t+2Dmaxη) ()
+m[MT]Q~Tt=τ(m)τ(m+1)-1[2𝗋𝖺𝖽-r,τ(m)(st,at)+4Dmax𝗋𝖺𝖽-p,τ(m)(st,at)], ()

where the last step makes use of Lemma 7 and Proposition 2. We accomplish the promised dynamic regret bound by the following four Lemmas that bound the dynamic regret terms (, , theequation-at-IDdz, theequation-at-IDea).

Lemma 8

Conditioned on Ep, we have

( )=O(BpWη).
Lemma 9

Conditioned on events Er,Ep, we have with probability at least 1-O(δ) that

( )=O~(D𝑚𝑎𝑥[M(T)+T])=O~(D𝑚𝑎𝑥[SATW+T]).
Lemma 10

With certainty,

(theequation-at-IDdz)=O((Br+D𝑚𝑎𝑥Bp)W+D𝑚𝑎𝑥Tη).
Lemma 11

With certainty, we have

(theequation-at-IDea)=O~(D𝑚𝑎𝑥SATW).

The proofs of Lemmas 8, 9, 10, and 11 are presented in Sections id1, id1, id1, and id1, respectively. Putting all these pieces together, we have the dynamic regret of the SWUCRL2-CW algorithm is upper bounded as

O~(BpWη+BrW+Dmax[BpW+SATW+Tη+SATW+T]),

and by setting W and η accordingly, we can conclude the proof.

We first claim that, for every episode mQT, there exists some state-action pair (s,a) and some time step tm[(τ(m)-W1),τ(m)-1] such that

pτ(m)(|s,a)-ptm(|s,a)1>η. (46)

For contradiction sake, suppose the otherwise, that is, pτ(m)(|s,a)-pt(|s,a)1η for every state-action pair s,a and every time step t[(τ(m)-W1),τ(m)-1]. For each state-action pair (s,a), consider the following cases on Nτ(m)(s,a)=q=(τ(m)-W)1τ(m)-1𝟏(sq=s,aq=a):

  • Case 1: Nτ(m)(s,a)=0. Then p^τ(m)(|s,a)=𝟎𝒮, and clearly we have

    pτ(m)(|s,a)-p^τ(m)(|s,a)1=1<𝗋𝖺𝖽-pτ(m)(s,a)<𝗋𝖺𝖽-pτ(m)(s,a)+η.
  • Case 2: Nτ(m)(s,a)>0. Then we have the following inequalities:

    pτ(m)(|s,a)-p¯τ(m)(|s,a)1
    = 1Nτ(m)+(s,a)q=(τ(m)-W)1τ(m)-1[pτ(m)(|s,a)-pq(|s,a)]𝟏(sq=s,aq=a)1 (47)
    1Nτ(m)+(s,a)q=(τ(m)-W)1τ(m)-1pτ(m)(|s,a)-pq(|s,a)1𝟏(sq=s,aq=a)η. (48)

    Step (47) is by the definition of p¯τ(m)(|s,a), and step (48) is by the triangle inequality. Consequently, we have

    pτ(m)(|s,a)-p^τ(m)(|s,a)1
    p¯τ(m)(|s,a)-p^τ(m)(|s,a)1+pτ(m)(|s,a)-p¯τ(m)(|s,a)1 (49)
    𝗋𝖺𝖽-pτ(m)(s,a)+η.

    Step (49) is true since we condition on the event p,

Combining the two cases, we have shown that pτ(m)(|s,a)Hp,τ(m)(s,a,η) for all s𝒮,a𝒜s, contradicting to the fact that mQT. Altogether, our claim on inequality (46) is established.

Finally, we provide an upper bound to |QT| using (46):

Bp= t[T-1]maxs𝒮,a𝒜s{pt+1(|s,a)-pt(|s,a)1}
mQTq=tmτ(m)-1maxs𝒮,a𝒜s{pq+1(|s,a)-pq(|s,a)1} (50)
mQTmaxs𝒮,a𝒜s{q=tmτ(m)-1(pq+1(|s,a)-pq(|s,a))1} (51)
> |QT|η. (52)

Step (50) follows by the second criterion of the construction of QT, which ensures that for distinct m,mQT, the time intervals [tm,τ(m)], [tm,τ(m)] are disjoint. Step (51) is by applying the triangle inequality, for each mQT, on the state-action pair (s,a) that maximizes the term q=tmτ(m)-1(pq+1(|s,a)-pq(|s,a))1=(pτ(m)(|s,a)-ptm(|s,a))1. Step (52) is by applying the claimed inequality (46) on each mQT. Altogether, the Lemma is proved.

We prove by contradiction. Suppose there exists an episode mQ~T, a state s𝒮, and an action a𝒜s such that pτ(m)(|s,a)Hp,τ(m)(s,a;η), then m should have been added to QT. To see this, we first note that episode m trivially satisfies criterion 1 in the construction of QT. Moreover, at the time when m is examined, we know that any m has been added to QT should satisfy τ(m)-τ(m)>W as otherwise m would have been added to Q~T. Therefore, we have prove mQTQ~T, which is clearly a contradiction.

Denote QT={m1,,m|QT|}. By construction, for every element mQ~T, there exists an unique mQT such that

τ(m)-τ(m)[0,W]. (53)

We can thereby partition the elements of Q~T into |QT| disjoint subsets Q~T(m1),,Q~T(m|QT|) such that

  1. 0.

    Each element mQ~T belongs to exactly one Q~T(m) for some mQT.

  2. 0.

    Each element mQ~T(m) satisfies τ(m)-τ(m)[0,W].

We bound () from above as

mQTmQ~T(m)t=τ(m)τ(m+1)-1(ρt*-rt(st,at))
mQTmQ~T(m)t=τ(m)τ(m+1)-11 (54)
= mQTmQ~T(m)(τ(m+1)-τ(m))
mQT(maxmQ~T(m)τ(m+1)-τ(m)) (55)
= mQT[maxmQ~T(m)(τ(m+1)-τ(m)+τ(m))-τ(m)]
mQT[maxmQ~T(m)(τ(m+1)-τ(m))+maxmQ~T(m)τ(m)-τ(m)]
mQT[W+W] (56)
= 2W|QT|
2BpW/η.

Here, inequality (54) holds by boundedness of rewards, inequality (55) follows from the fact that episodes are mutually disjoint, inequality (56) makes the observations that each episode can last for at most W time steps (imposed by the SWUCRL2-CW algorithm) as well as criterion 2 of the construction of Q~T(m)’s, and the last step uses Lemma 6.

We first give an upper bound for M(T), the total number of the episodes.

Lemma 12

Conditioned on events Er,Ep, we have M(T)SA(2+log2W)T/W=O~(SAT/W) with certainty.

\@trivlist

irst, to demonstrate the bound for M(T), it suffices to show that there are at most SA(2+log2W) many episodes in each of the following cases: (1) between time steps 1 and W, (2) between time steps jW and (j+1)W, for any j{1,,T/W-1}, (3) between time steps T/WW and T. We focus on case (2), and the edge cases (1, 3) can be analyzed similarly.

Between time steps jW and (j+1)W, a new episode m+1 is started when the second criterion νm(st,π~m(st))<Nτ(m)+(st,π~m(st)) is violated during the current episode m. We first illustrate how the second criterion is invoked for a fixed state-action pair (s,a), and then bound the number of invocations due to (s,a). Now, let’s denote m1,,mL as the episode indexes, where jWτ(m1)<τ(m2)<<τ(mL)<(j+1)W, and the second criterion for (s,a) is invoked during m for 1L. That is, for each {1,,L}, the DM breaks the while loop for episode m because νm(s,a)=Nτ(m)+(s,a), leading to the new episode m+1.

To demonstrate our claimed bound for M(T), we show that L2+log2W as follows. To ease the notation, let’s denote ψ=νm(s,a). We first observe that ψ1=Nτ(m1)+(s,a)1. Next, we note that for {2,L}, we have ψk=1-1ψk. 11endnote: 1 We proceed slightly differently from the stationary case, where the corresponding Nt(s,a) is non-decreasing in t (Jaksch et al. 2010), which is clearly not true here due to the use of sliding windows Indeed, we know that for each we have (τ(m+1)-1)-τ(m1)W, by our assumption on m1,,m. Consequently, the counting sum in Nτ(m)(s,a), which counts the occurrences of (s,a) in the previous W time steps, must have counted those occurrences corresponding to ψ1,,ψ-1. The worst case sequence of ψ1,ψ2,,ψL that yields the largest L is when ψ1=ψ2=1, ψ3=2, and more generally ψ=2-2 for 2. Since ψW for all W, we clearly have L2+log2W, proving our claimed bound on L. Altogether, during the T time steps, there are at most (SAT(2+log2W))/W episodes due to the second criterion and T/W due to the first, leading to our desired bound on M(T). \@endparenv

Next, we establish the bound for (). By Lemma 7, we know that γ~τ(m)(s)[0,2Dmax] for all m[M(T)]Q~T and s. For each episode m[M(T)]Q~T, we have

t=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]
= -γ~τ(m)(sτ(m))+γ~τ(m)(sτ(m)+1)2Dmax+t=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st+1)]=Yt. (57)

Summing (57) over m[M(T)]Q~T we get

m[M(T)]Q~Tt=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]
2Dmax(M(T)-|Q~T|)+m[M(T)]Q~Tt=τ(m)τ(m+1)-1Yt. (58)

Define the filtration t-1={(sq,aq,Rq(sq,aq))}q=1t. Now, we know that 𝔼[Yt|t-1]=0, Yt is σ(t)-measurable, and |Yt|2Dmax. Therefore, we can apply the Hoeffding inequality (Hoeffding 1963) to show that

m[M(T)]Q~Tt=τ(m)τ(m+1)-1[s𝒮pt(s|st,at)γ~τ(m)(s)-γ~τ(m)(st)]=O(DmaxM(T)+DmaxTlog1δ)

with probability 1-O(δ), where we use the facts that M(T)-|Q~T|M(T) and m[M(T)]Q~Tt=τ(m)τ(m+1)-11T. Finally, note that

m[M(T)]Q~Tt=τ(m)τ(m+1)-11τ(m)m=1M(T)t=τ(m)τ(m+1)-11τ(m)i=1T/WWiW=O(T).

Hence, the Lemma is proved.

We first note that

m[MT]Q~Tt=τ(m)τ(m+1)-1(2𝗏𝖺𝗋r,t+4Dmax𝗏𝖺𝗋p,t+2Dmaxη)t=1T(2𝗏𝖺𝗋r,t+4Dmax𝗏𝖺𝗋p,t+2Dmaxη),

since 𝗏𝖺𝗋r,t0 and 𝗏𝖺𝗋p,t0 for all t. We can thus work with the latter quantity.

We first bound t=1T𝗏𝖺𝗋r,t. Now, recall the definition that, for time t in episode m, we have defined 𝗏𝖺𝗋r,t=q=τ(m)-Wt-1Br,q. Clearly, for iWq<(i+1)W, the summand Br,q only appears in 𝗏𝖺𝗋r,t for iWq<t(i+2)W, since each episode is contained in {iW,,(i+1)W} by our episode termination criteria (t is a multiple of W) of the SWUCRL2-CW algorithm. Altogether, we have

2t=1T𝗏𝖺𝗋r,t2t=1T-1Br,tW=2BrW. (59)

Next, we bound t=1T𝗏𝖺𝗋p,t. Now, we know that τ(m+1)-τ(m)W by our episode termination criteria (t is a multiple of W) of the SWUCRL2-CW algorithm. Consequently,

4Dmaxt=1T𝗏𝖺𝗋p,t4Dmaxt=1T-1Bp,tW=4DmaxBpW. (60)

Finally, combining (59, 60) with 2Dmaxt=1Tη, the Lemma is established.

Due to non-negativity of 𝗋𝖺𝖽-rt(s,a)’s and 𝗋𝖺𝖽-pt(s,a)’s, we have

m[MT]Q~Tt=τ(m)τ(m+1)-1𝗋𝖺𝖽-rτ(m)(st,at)m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-rτ(m)(st,at), (61)
m[MT]Q~Tt=τ(m)τ(m+1)-1𝗋𝖺𝖽-pτ(m)(st,at)m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-pτ(m)(st,at) (62)

We thus first show that, with certainty,

m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-rτ(m)(st,at)=m=1M(T)t=τ(m)τ(m+1)-122ln(SAT/δ)Nτ(m)+(st,at)=O~(SATW), (63)
m=1M(T)t=τ(m)τ(m+1)-1𝗋𝖺𝖽-pτ(m)(st,at)=m=1M(T)t=τ(m)τ(m+1)-122Slog(ATW/δ)Nτ(m)+(st,at)=O~(SATW). (64)

We analyze by considering the dynamics of the algorithm in each consecutive block of W time steps, in a way similar to the proof of Lemma 9. Consider the episodes indexes m0,m1,mT/W,mT/W+1, where τ(m0)=1, and τ(mj)=jW for j{1,,T/W}, and mT/W+1=m(T)+1 (so that τ(mT/W+1-1) is the time index for the last episode in the horizon).

To prove (63, 64), it suffices to show that, for each j{0,1,,T/W}, we have

m=mjmj+1-1t=τ(m)τ(m+1)-11Nτ(m)+(st,at)=O(SAW). (65)

Without loss of generality, we assume that j{1,,T/W-1}, and the edge cases of j=0,T/W can be analyzed similarly.

Now, we fix a state-action pair (s,a) and focus on the summands in (65):

m=mjmj+1-1t=τ(m)τ(m+1)-1𝟏((st,at)=(s,a))Nτ(m)+(st,at)=m=mjmj+1-1νm(s,a)Nτ(m)+(s,a) (66)

It is important to observe that:

  1. 0.

    It holds that νmj(s,a)Nτ(mj)(s,a), by the episode termination criteria of the SWUCRL2-CW algorithm,

  2. 0.

    For m{mj+1,,mj+1-1}, we have m=mjm-1νm(s,a)Nτ(m)(s,a) . Indeed, we know that episodes mj,,mj+1-1 are inside the time interval {jW,,(j+1)W}. Consequently, the counts of “(st,at)=(s,a)” associated with {νm(s,a)}m=mjm-1 are contained in the W time steps preceding τ(m), hence the desired inequality.

With these two observations, we have

(66) νmj(s,a)max{νmj(s,a),1}+m=mj+1mj+1-1νm(s,a)max{m=mjm-1νm(s,a),1}
max{νmj(s,a),1}+(2+1)max{m=mjmj+1-1νm(s,a),1} (67)
(2+2)max{t=jW(j+1)W-1𝟏((st,at)=(s,a)),1}. (68)

Step (67) is by Lemma 19 in (Jaksch et al. 2010), which bounds the sum in the previous line. Step (68) is by the fact that episodes mj,,mj+1-1 partitions the time interval jW,,(j+1)W-1, and νm(s,a) counts the occurrences of (st,at)=(s,a) in episode m. Finally, observe that (66)=0 if νm(s,a)=0 for all m{mj,,mj+1-1}. Thus, we can refine the bound in (68) to

(66)(2+2)t=jW(j+1)W-1𝟏((st,at)=(s,a)). (69)

The required inequality (65) is finally proved by summing (69) over s𝒮a𝒜 and applying Cauchy Schwartz:

m=mjmj+1-1t=τ(m)τ(m+1)-11Nτ(m)+(st,at)=s𝒮,a𝒜sm=mjmj+1-1νm(s,a)Nτ(m)+(s,a)
(2+2)s𝒮,a𝒜st=jW(j+1)W-1𝟏((st,at)=(s,a))
(2+2)SAW.

Altogether, the Lemma is proved.

To begin, we consider the following regret decomposition, for any choice of (W,η)J, we have

Dyn-RegT(𝙱𝙾𝚁𝙻)= i=1T/H𝔼[t=(i-1)H+1iHTρt*-𝖱i(Wi,ηi,s(i-1)H+1)]
= i=1T/H𝔼[t=(i-1)H+1iHTρt*-𝖱i(W,η,s(i-1)H+1)]
+ i=1T/H𝔼[i=1T/H𝖱i(W,η,s(i-1)H+1)-𝖱i(Wi,ηi,s(i-1)H+1)]. (70)

For the first term in eqn. (70), we can apply the results from Theorem 1 to each block iT/H, i.e.,

t=(i-1)H+1iHT[ρt*-𝖱(W,η,s(i-1)H+1)]= O~(Bp(i)Wη+Br(i)W+Dmax[Bp(i)W+SAHW+Hη+SAHW+H]), (71)

where we have defined

Br(i)=(t=(i-1)H+1iHTBr,t),Bp(i)=(t=(i-1)H+1iHTBp,t)

for brevity. For the second term, it captures the additional rewards of the DM were it uses the fixed parameters (W,η) throughout w.r.t. the trajectory on the starting states of each block by the BORL algorithm, i.e., s1,,s(i-1)H+1,,s(T/H-1)H+1, and this is exactly the regret of the EXP3.P algorithm when it is applied to a Δ-arm adaptive adversarial bandit problem with reward from [0,H]. Therefore, for any choice of (W,η), we can upper bound this by

O~(HΔT/H)=O~(TH)

as Δ=O(ln2T). Summing these two, the regret of the BORL algorithm is

Dyn-RegT(𝙱𝙾𝚁𝙻)=O~(BpWη+BrW+Dmax[BpW+SATW+Tη+SATW+TH]). (72)

By virtue of the EXP3.P, the BORL algorithm is able to adapt to any choice of (W,η)J. Note that

HW*=3S23A12T12(Br+Bp+1)123T12(3T)121, (73)
S13A14η*=(Bp+1)12S13A14(Br+Bp+1)14T14S13A142T12=S13A14Φ. (74)

Therefore, there must exists some j and k such that

Hj/ΔWW*H(j+1)/ΔW,S13A14Φk/Δηη*S13A14Φ(k+1)/Δη (75)

By adapting W to Hj/ΔW and η to Φk/Δη, we further upper bound eqn. (72) as

Dyn-RegT(𝙱𝙾𝚁𝙻)=O~(Dmax(Br+Bp+1)14S23A12T34).

where we use H1/ΔW=exp(1) and Φ1/Δη=exp(-1) in the last step.

We first show the following lemma.

Lemma 13

Conditioned on Ep, there exists a state transition distribution pHp,t(0) such that for every pair of states s,sS,

p(s|s,a(s,s))ζ

for every time step t[T].

The proof of the lemma is provided in Section id1. We then consider the state transition distribution pHp,t(0) specified in Lemma 13. For an arbitrary state s𝒮, we consider the policy π such that π(s)=a(s,s) for all s𝒮 (see Assumption 2 for the definition of a(s,s)). Starting from an arbitrary state s𝒮, the policy either hits state s in the next step, which happens with probability at least ζ, or it transits to another state s′′s, which would then hit state s in the next step with probability at least ζ. Therefore, the hitting process stochstically dominates the Bernoulli process with success probability ζ, and thus the expected hitting time is at most 1/ζ.

First, we recall the definition of defintion of confidence region Hp,t(s,a;0) in eqn. 6,

Hp,t(s,a;0)={p˙Δ𝒮:p˙(|s,a)-p^t(|s,a)1𝗋𝖺𝖽-p,t(s,a)}.

For every pair of states s,s𝒮, we construct p by distinguishing the following two cases:

  • If Nt(s,a(s,s))=0, then by definition, 𝗋𝖺𝖽-p,t(s,a(s,s))1, therefore every probability distribution p¯ on 𝒮 belongs to Hp,t(s,a;0). Setting p(|s,a(s,s))=pt(|s,a(s,s)) for any t satisfies the requirement in the Lemma.

  • If Nt(s,a(s,s))>0, then we know from Assumption 2 and eqn. 5 that

    p¯t(s|s,a(s,s))=1Nt+(s,a(s,s))q=(τ(m)-W)1t-1pq(s|s,a(s,s))𝟏(sq=s,aq=a(s,s))ζ.

    By conditioning on p, we know that p¯t(|s,a(s,s))Hp,t(s,a(s,s);0), and we can thus set p(|s,a(s,s))=p¯t(|s,a(s,s)).

Combining the above cases, the transition probability distribution p satisfies the stated inequality in the Lemma, and we conclude the proof.

We first show that for any time step t[T], we have ρt*-ρt*pseudo=-l𝔼[Xt]. From Section id1, we have ρt* is equal to the optimal value of the following linear program 𝖯(rt,pt); while ρt*pseudo is equal to the optimal value of the following linear program 𝖯(rtpseudo,pt). The two linear programs has the same set of constraints, and follows from eqn. (14), the only difference is that the objective value of 𝖯(rtpseudo,pt) is l𝔼[Xt] more than that of 𝖯(rt,pt) (note that the summation of x(s,a) over s𝒮 and a𝒜s is 1 from the second constraint of the linear program (15)). Therefore, we have

t=1T(ρt*-ρt*pseudo)=t=1T-l𝔼[Xt]. (76)

Next, conditioned on any demand realizations X1,,XT, we can show by induction that the trajectory generated by Π on and pseudo are exactly the same as they use the same sequence of state transition distributions. Therefore,

t=1T𝔼[rt(st(),at())|{Xs}s=1t]-t=1T𝔼[rtpseudo(st(pseudo),at(pseudo))|{Xs}s=1t]=t=1T-lXt. (77)

Taking expectation on both sides of eqn. (77), and combining this with eqn. (76), we can conclude the statement.