Abstract
This work is motivated by a practical concern from our retail partner. Whilethey respect the advantages of dynamic pricing, they must limit the number ofprice changes to be within some constant. We study the classical pricebasednetwork revenue management problem, where a retailer has finite initialinventory of multiple resources to sell over a finite time horizon. We considerboth known and unknown distribution settings, and derive policies that have thebestpossible asymptotic performance in both settings. Our results suggest anintrinsic difference between the expected revenue associated with how manyswitches are allowed, which further depends on the number of resources. Ourresults are also the first to show a separation between the regret boundsassociated with different number of resources.
Quick Read (beta)
section§#2#1#3 \crefformatsubsection§#2#1#3 \crefformatsubsubsection§#2#1#3
[10pt]
Network Revenue Management with Limited Switches: Known and Unknown Demand Distributions
David SimchiLevi, Yunzong Xu, Jinglong Zhao
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
Our work is motivated by a common business constraint in the retail industry. While retailers respect the advantages of dynamic pricing and price experimentation, they must limit the number of price changes to be within some range due to various practical reasons.
We study the classical pricebased network revenue management problem, where a retailer has finite initial inventory of multiple resources to sell over a finite time horizon. We consider both known and unknown distribution settings, and derive policies that have the bestpossible asymptotic performance in both settings. Our results suggest an intrinsic difference between the expected revenue associated with how many switches are allowed: in the distributionallyknown setting, a resourcedependent amount of switching budgets is required to achieve sublinear regret; and in the distributionallyunknown setting, a resourcedependent piecewiseconstant function that characterize the degree of sublinear regret.
To the best of our knowledge, our results are the first to exhibit a separation in regrets between inventory constrained and unconstrained stochastic online learning problems. Practically, retailers may benefit from our study in designing their budgets of price switches.
We consider the classical network revenue management problem. A firm has finite inventory of multiple resources to sell over a finite time horizon. The starting inventory is unreplenishable and exogenously given. The firm can control its sales through sequential decisions on the offered prices, which come from a discrete set of candidates. Its objective is to maximize the cumulated revenue.
We consider two settings when demand is either distributionallyknown, or distributionallyunknown; in both cases we assume demand to be stochastic, independent and time homogeneous (stochastic IID demand) over the time horizon. Such settings are well studied in the literature. In the distributionallyknown case, see Gallego and Van Ryzin (1997), Jasin (2014); in the distributionallyunknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018). The literature has also studied other settings assuming different demand models, where demand could be adversarial (Mehta et al. 2005, Ball and Queyranne 2009), could be correlated over time (Araman and Caldentey 2009, Truong and Wang 2019), and could be time nonhomogeneous (Adelman 2007, Topaloglu 2009, Erdelyi and Topaloglu 2011). But all are beyond the scope of this paper.
In the distributionallyknown setting, the firm must tradeoff between revenuecentric decisions which maximize immediate expected revenue irrespective of inventory constraints, and inventorycentric decisions which maximize the yield from the remaining inventory. Revenuecentric decisions tend to be myopic and favor the most popular items, while inventory centric decisions tend to be conservative and favor the highly stocked items. Intuitively, the optimal policy alternates between revenuecentric and inventorycentric decisions based on the remaining inventory and time periods.
Moreover, in the distributionallyunknown setting, the firm must also tradeoff between exploitation decisions which utilize the learned information to maximize the expected revenue as if it was in the distributionallyknown setting, and exploration decisions which discover the demand distributions of the less certain actions, regardless of how rewarding an action is. Exploitation decisions tend to favor the more rewarding items (with respect to inventory constraints), while exploration decisions tend to favor the less discovered items. Intuitively, the optimal policy alternates between exploitation and exploration decisions based on the information learned from the realized demands.
In both settings, the optimal policy has to adjust its decisions and instantaneously switch between actions over the time horizon. However, not all retailers have the infrastructure to query the realized demand in realtime, to adjust its decisions instantaneously, or to switch between actions as freely as possible, because changing the posted prices is too costly for many retailers (Levy et al. 1998, Zbaracki et al. 2004), and frequent price changes may confuse the customers (Jørgensen et al. 2003). A common practice for many firms is that they restrict the number of price changes to be as few as possible; see Netessine (2006), Chen et al. (2015), Cheung et al. (2017), Chen and Chao (2019), SimchiLevi and Xu (2019), Perakis and Singhvi (2019).
Motivated by this problem, we analyze the asymptotic performance of policies with limited switches on the network revenue management problem with known and unknown demand distribution. In both settings, we show there is an intrinsic difference between the expected revenue earned associated with how many switches are allowed, which further depends on the number of resources.
We consider the time horizon to consist of a discrete number of time periods. We model each product as having discrete “prices points” at which it could be sold. This captures situations where fixed price points have been predetermined by market standards, e.g. a common menu of prices that end in $9.99: $69.99, $79.99, $99.99. There is quite a difference here from the papers that model each product as having a continuous price range; see Jasin (2014) for distributionallyknown setting, and Wang et al. (2014), Chen and Shi (2019), Li and Zheng (2019) for distributionallyunknown setting. Our work requires completely different analytical techniques.
We model the business constraints of limited price changes as a hard constraint. The retailer is initially endowed with a constant number of switching budgets, to change the posted price (or price vector) from one to another. For example, if there is only one product, a sequence of prices $(\$79.99,\$89.99,\$79.99,\$89.99)$ uses two distinct prices, and makes three price changes.
We will consider the following two demand models one by one, where the design and analysis of effective policies to the second model are based on understandings of the first model.

0.
Distributionally known demand (Section id1): In each time period, a stochastic, independent and time homogeneous demand is triggered from the prices we post.

0.
Distributionally unknown demand (Section id1): We only know that the demand is independent and time homogeneous, but we do not have any prior knowledge about the distributions. We have to learn the distributions over time.
For both settings, we can generalize our results to the “Bandits with Knapsacks” model (Badanidiyuru et al. 2013), which allows the perunit revenue to be random, and further allows for any correlated distributions of the consumption quantity and the revenue.
We focus on the asymptotic performance of policies. We introduce our results in both settings using linear scaling as our asymptotic regime, which is omnipresent in the literature. In the distributionallyknown case, see Gallego and Van Ryzin (1997), Liu and Van Ryzin (2008), Jasin (2014), Bumpensanti and Wang (2018); in the distributionallyunknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018), Chen and Shi (2019). Specifically, we assume both the time horizon and all the initial inventory levels are scaled by the same factor $\kappa $, while demand distributions in all periods remain the same.
There is a second asymptotic regime in the literature that does not require linear scaling. Instead, it only requires the minimum quantity of any resource (and time periods) to go to infinity. In the distributionallyknown case, see Alaei et al. (2012), Ma et al. (2018); in the distributionallyunknown case, see Badanidiyuru et al. (2013).
We adopt the revenue loss as our optimality notion. In the distributionallyknown setting, revenue loss is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal policy with infinite switching budgets. The optimal policy with infinite switching budgets could be solved using a dynamic program, yet due to curse of dimensionality people deem it hard to solve (Gallego and Van Ryzin 1997). Revenue loss is one common metric to quantify the optimality gap, among the rich literature of different metrics; see Adelman (2007), Jasin (2014), Chen and Goldberg (2018).
In the distributionallyunknown setting, revenue loss is often referred to as “regret”. We study two regret notions in this setting: the first one, referred to as the $s$switch regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with the same switching budget; the second one, referred to as the overall regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with infinite switching budgets. (Here, a clairvoyant policy is endowed with perfect knowledge of the distributions, but not the exact realizations.) While the overall regret is clearly larger than the $s$switch regret, interestingly, we show that asymptotically they are of the same order.
We provide full characterization of the revenue loss under two demand models, endowed with different levels of switching budgets.
In the distributionallyknown setting (Section id1), there exists a critical switching budget above which we show the revenue loss is in the order of $\mathrm{\Theta}(\sqrt{T})$; and below which the loss is in the order of $\mathrm{\Theta}(T)$. Specifically, we show that for any problem instance, there exists an instance dependent quantity $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}$ (“Dimension of the Sparsest LPsolution”, see the precise definition in Section id1), such that any policy that makes at most $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}2$ switches must incur an $\mathrm{\Omega}(T)$ revenue loss, while a class of static control policies that make at most $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1$ switches can guarantee an $O(\sqrt{T})$ revenue loss. Combining the above results, we show that there is an intrinsic gap on expected revenue loss, if one more critical price change is allowed. See Figure 1.
In the distributionallyunknown setting (Section id1), we show matching upper and lower bounds on the optimal (worstcase) regret in the order of $\stackrel{~}{\mathrm{\Theta}}({T}^{\frac{1}{2{2}^{\nu (s,m)}}})$, where $\nu (m,s)=\lfloor \frac{sm1}{K1}\rfloor $. Here $s$ stands for the switching budget, $m$ for the number of resources, and $K$ for the number of feasible price vectors, see Figure 2 for an illustration. Specifically, we propose an SSwitch Balanced Exploration with LPbased Exploitation algorithm that achieves an $\stackrel{~}{O}({T}^{\frac{1}{2{2}^{\nu (s,m)}}})$ regret; and we show a lower bound that any policy must incur an $\stackrel{~}{\mathrm{\Omega}}({T}^{\frac{1}{2{2}^{\nu (s,m)}}})$ regret in the worst case distributions. Our result generalizes the bounds in SimchiLevi and Xu (2019), which is a special case when $m=0$, i.e., no inventory constraints.
In the distributionallyunknown setting, our results reveal an interesting separation between the (stochastic) online learning problems with inventory constraints and without inventory constraints. Curiosity has been aroused around this separation for long, that the regret bounds in both the classical multiarmed bandit problems and the bandits with knapsacks problems are in the order of $\stackrel{~}{O}(\sqrt{T})$. Yet people believe that the problem with inventory constraints is “harder” than the unconstrained counterparts. Our work shows that the constrained and unconstrained problems are indeed different, and characterize the dependency on the number of inventory constraints. If we fix all the other problem primitives unchanged and only add in one more inventory constraint, then the regret bounds is going to be larger or equal to before, illustrating that inventory constraints indeed increase the revenue loss – they make the problem “harder”.
We establish equivalence results of limited switching budgets and limited adaptivity, in the context of network revenue management problems. Limited switching budgets, as we have introduced in Section id1, is the hard constraint that one cannot change prices more than a fixed number of times. Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. We can think of full adaptivity as the ability to collect immediate feedbacks, or to query the status of the system; and limited adaptivity corresponds to collecting batched and delayed feedbacks. An algorithm with limited switching budgets can keep track of the demands in each period, yet constrained on price changes; while an algorithm with limited adaptivity can prescribe a trajectory of different prices with unlimited changes, yet without knowing the status of the system. When either ability (switching budgets or adaptivity) is constrained, we do not need the other ability more than necessary.
Finally, we introduce new techniques in the design and analysis of asymptotically optimal algorithms, which we will discuss in the following sections when we introduce the proof sketches.
There are two streams of network revenue management problems, the quantitybased problem where customers first arrive and then seller makes accept / reject decisions, and the pricebased problem where seller first announces a price and then customers makes purchase decisions. For more discussion, see Talluri and Van Ryzin (2006), Gallego et al. (2018). Quantitybased and pricebased problems are very closely related. For example, when there is only one single resource, these two problems are equivalent (Maglaras and Meissner 2006).
There is a rich literature in reoptimization techniques in network revenue management, featuring in frequent or infrequent adjustments in the action space (depending on if the problem is quantity or price based). Reoptimization techniques choose careful moments in time to make adjustments (e.g. to change prices in the pricebased context), to better respond to the randomness, and to achieve a smaller expected revenue loss. Jasin (2014), in the context of pricebased network revenue management, requires $O(\mathrm{log}T)$ number of price changes to achieve a $O(\mathrm{log}T)$ order of expected revenue loss, compared to the DLP upper bound. For more discussion of the reoptimization techniques, as well as the selection of different upper bounds, see Reiman and Wang (2008), Secomandi (2008), Chen and Homemde Mello (2010), Ciocan and Farias (2012), Jasin and Kumar (2012), Bumpensanti and Wang (2018) for quantitybased reoptimization methods; see Jasin (2014), Chen et al. (2015) for pricebased reoptimization methods; see Chen and Farias (2013), Vera et al. (2019) for reoptimization methods on singleresource dynamic pricing problems; see Arlotto and Xie (2018), Arlotto and Gurvich (2019), Vera and Banerjee (2019) for reoptimization methods on multisecretary problems and multiple constraint knapsack problems, which are directly related to the network revenue management problem.
There is a rich literature on stochastic multiarmed bandit (MAB) problems with switching costs Agrawal et al. (1988, 1990), CesaBianchi et al. (2013). See Jun (2004) for a survey. Most paper from this stream of research penalizes switching costs into the objective function of the reward calculation. That is, the objective is to minimize a convex combination of the regret incurred, and the switching costs. By contrast, our paper treats the switching budgets as hard constraint that can never be violated.
A closely related paper is SimchiLevi and Xu (2019), which discusses the stochastic multiarm bandit problem with the same hard constraint of switching budgets as us. Our work generalize their results by incorporating additional knapsack constraints Badanidiyuru et al. (2013). This extension gives us more modeling flexibility in network revenue management, as well as medical trials, and online advertising, etc. Moreover, our work develops significant new techniques to address new challenges due to the knapsack constraints. Even in the distributionallyknown case, a revenue maximizing policy has to switch between prices to avoid consuming some inventory too quickly; by contrast, without knapsack constraints, the best policy always sticks to one single price.
There are several ways to model the business constraint of limited switching budgets. Literature have studied quite a few models of infrequent price changes, which are similar to the switching budgets we consider in our paper. In the distributionallyknown setting, Netessine (2006) considers a singleproduct deterministic demand model, and focuses on the impact of a limited number of price changes. Chen et al. (2015) considers the same network revenue management model as we do, and focuses on asynchronously changing the prices of a small subset of products. In the distributionally unknown setting, Cheung et al. (2017) considers a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy makes limited number of price changes. Chen and Chao (2019) studies a multiperiod stochastic joint inventory replenishment and pricing problem with unknown demand and limited price changes. SimchiLevi and Xu (2019) considers the stochastic multiarmed bandit problem with a general switching constraint. All of the above three papers adopt the idea of switching budgets as this paper does. Yet none of the above paper considers the existence of nonreplenishable inventory constraints.
To the best of our knowledge, this paper is the first to characterize the exact dependency of expected revenue loss on limited switching budgets, in the pricebased network revenue management setup, with and without the knowledge of demand distributions. In particularly, results in this paper generalize and unify several results appearing in the following papers: Gallego and Van Ryzin (1994), Besbes and Zeevi (2012), Badanidiyuru et al. (2013), Ferreira et al. (2018), Ma et al. (2018), SimchiLevi and Xu (2019).
Throughout this paper, let $\mathbb{R}$ be the set of real numbers, and ${\mathbb{R}}_{+}$ the set of nonnegative real numbers. Let $\mathbb{N}$ be the set of positive integers. $\forall N\in \mathbb{N}$, let $[N]=\{1,2,\mathrm{\dots},N\}$ denote the set of integers that are no more than $N$. We use bold font letters for vectors, where we do not explicitly indicate how large the dimension is. Let ${\bm{p}}^{\U0001d5b3}$ be the transpose of any vector $\bm{p}$. For any vector $\bm{x}\in {\mathbb{R}}^{K}$, let ${\parallel \bm{x}\parallel}_{0}={\sum}_{k\in [K]}{\text{\U0001d7d9}}_{{x}_{k}\ne 0}$ be the ${L}_{0}$ norm of $\bm{x}$, i.e. the number of nonzero elements in vector $\bm{x}$. For any real number $x\in \mathbb{R}$, let ${x}^{+}=\mathrm{max}\{x,0\}$ be the nonnegative part of $x$. For any positive real number $x\in {\mathbb{R}}_{+}$, let $\lfloor x\rfloor $ be the largest integer that is smaller or equal to $x$; for any nonpositive real number $x\in \mathbb{R}{\mathbb{R}}_{+}$, let $\lfloor x\rfloor =0$. We use big $O,\mathrm{\Omega},\mathrm{\Theta}$ notations to hide constant factors, and use $\stackrel{~}{O},\stackrel{~}{\mathrm{\Omega}},\stackrel{~}{\mathrm{\Theta}}$ notations to hide both constant factors and logarithmic factors.
We study the classical network revenue management problem. Let there be discrete, finite time horizon with $T$ periods. Time starts from period $1$ and ends in period $T$. Let there be $n$ different products generated by $m$ different resources, each resource endowed with finite initial inventory ${B}_{i},\forall i\in [m]$. Let $A={[{a}_{ij}]}_{i\in [m],j\in [n]}$ be the consumption matrix. Each entry ${a}_{ij}\in {\mathbb{R}}_{+}$ stands for the amount of inventory $i\in [m]$ used, if one unit of product $j\in [n]$ is sold. Let ${a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}={\mathrm{max}}_{i,j}{a}_{ij}$. Each column ${A}^{j}$ stands for the consumption vector of each resource. Each column contains at least one nonzero entry. Let ${A}_{i}$ denote the $i$th row of $A$.
In each period $t$, the firm can offer a price for each product from a finite set of $K$ price vectors, which we denote using $\{{\bm{p}}_{1},{\bm{p}}_{2},\mathrm{\dots},{\bm{p}}_{K}\}$. Here ${\bm{p}}_{k}=({p}_{1,k},{p}_{2,k},\mathrm{\dots},{p}_{n,k})$ are the prices for products $1,\mathrm{\dots},n$, under price vector $k\in K$. All the price vectors are fixed and given. Let ${p}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}={\mathrm{max}}_{j,k}{p}_{j,k}$ be the maximum price. Given price ${\bm{p}}_{k}$, the demand for each product $j\in [n]$ is a given Bernoulli random variable ${Q}_{j,k}:={Q}_{j}({\bm{p}}_{k})\in \{0,1\}$. Denote ${q}_{j,k}:=\mathbb{E}[{Q}_{j,k}]$ as the expected demand. Our analysis can be directly generalized to general bounded distributions ${Q}_{j,k}\in [0,1]$, similar to most prophet inequality and revenue management literature Kleinberg and Weinberg (2012), Alaei et al. (2012), Dütting et al. (2017), Liu and Van Ryzin (2008), Jasin (2014), Ma et al. (2018). For the ease of exposition we describe our results in the Bernoulli setting, yet our proof only relies on the fact that demands are bounded between $[0,1]$.
For each unit of demand generated for product $j\in [n]$ under price vector ${\bm{p}}_{k},\forall k\in [K]$, the firm generates ${p}_{j,k}$ units of revenue by depleting ${a}_{ij}$ units of each inventory $i\in [m]$. If no demand is generated, all the remaining inventory is carried over into the next period. The selling process stops immediately when the total cumulative demand of any resource exceeds its initial inventory.
This modeling framework gives us the flexibility to have two identical consumption vectors sold at two different prices. We simply treat them as different products, i.e. in the airline context, main cabin vs. basic economy.
In Section id1 we have addressed the business constraint that a firm is often prevented from changing the price vectors too many times. Throughout the horizon, we can change the posted price vectors for no more than $S$ times. We treat $S$ as a fixed constant. Intuitively, when $S$ is small we are more constrained, and when $S$ is large we are close to the classical network revenue management problem studied in the literature. Our objective is to maximize the total expected revenue in the face of limited switches.
We adopt the general notation $\pi :{\mathbb{R}}^{n}\times [T]\to [n]$ to denote any policy with the full information about stochastic distributions, that suggests which price vector to use given the remaining periods $T$ and remaining inventory $\bm{B}$. For any $s\in \mathbb{N}$, let $\mathrm{\Pi}[s]$ be the set of policies that changes prices for no more than $s$ times. For any $s,{s}^{\prime}\in \mathbb{N}$ such that $s\le {s}^{\prime}$, $\mathrm{\Pi}[s]\subseteq \mathrm{\Pi}[{s}^{\prime}]$. Let $\mathrm{\Pi}:={lim}_{s\to +\mathrm{\infty}}\mathrm{\Pi}[s]$ be the set of policies with infinite switching budgets (which is the set of all feasible policies when there is no switching constraint). Let $\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )$ be the expected revenue that policy $\pi $ generates in expectation. Let ${\pi}^{*}[s]\in \mathrm{arg}{\mathrm{max}}_{\pi \in \mathrm{\Pi}[s]}\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )$ be one of the optimal dynamic policies with switching budget $s$.
We adopt linear scaling as our asymptotic regime. We assume all the problem parameters $\mathcal{I}=(m,n,K,\bm{p},\bm{q},A)$ are fixed, where $\bm{p}:={({\bm{p}}_{k})}_{k\in [K]}$ and $\bm{q}:={({q}_{j,k})}_{j\in [n],k\in [K]}$. We assume $T$ and ${B}_{j},\forall j\in [m]$ are at the same comparable order, and larger than either one of $m,n,K$.
We summarize our results, as well as some closely related results from literature, in Figure 3.
For any problem instance $\mathcal{I}=(m,n,K,\bm{p},\bm{q},A)$, litarature have studied the following deterministic linear program (DLP).
${\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}=\underset{{\{{x}_{k}\}}_{k\in [K]}}{\mathrm{max}}{\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{x}_{k}$  (1)  
$\text{s.t.}{\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{x}_{k}$  $\le {B}_{i}$  $\forall i\in [m]$  (2)  
$\sum _{k\in [K]}}{x}_{k$  $\le T$  (3)  
${x}_{k}$  $\ge 0$  $\forall k\in [K]$  (4) 
It is well known that the above DLP serves as an upper bound to any policy $\pi \in \mathrm{\Pi}$ (“DP OPT” as in Figure 3), even an optimal policy with infinite switching budgets. The worstcase gap between the best policy and the DLP upper bound is in the order of $\mathrm{\Theta}(\sqrt{T})$^{1}^{1} 1 We give some comments on the $\mathrm{\Omega}(\sqrt{T})$ lower bound for this gap. It is a worstcase lower bound from Bumpensanti and Wang (2018), Ma et al. (2018). Both these results from literature require the feasible price set to have only one price (vector). Since there is only one price, the optimal dynamic programming policy is essentially a fixed, zeroswitch policy, which is a special case when $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}=1$. It remains an interesting open question how small could the gap be, between the DLP upper bound and any policy endowed with more than $(\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1)$ switching budgets., as shown in Figure 3.
Our results contribute to the literature by first characterizing how a policy’s switching budget affects its revenue loss. The main implication of our results is that for each problem instance, there is an intrinsic gap between the best achievable performance of policies with $\mathrm{\Lambda}2$ switches and policies with $\mathrm{\Lambda}1$ switches, as shown in Figure 3.
We start with a closer look of the DLP optimal solutions. Let the set of optimal solutions from the DLP be ${X}^{*}=\mathrm{arg}{\mathrm{max}}_{\bm{x}\in {\mathbb{R}}^{K}}\{(\mathit{\text{theatequationgroupatID}})(\mathit{\text{theatequationgroupatID}}),(\mathit{\text{theatequationgroupatID}}),(\mathit{\text{theatequationgroupatID}})\text{are satisfied}\}$. Let $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}=\mathrm{min}\{{\parallel \bm{x}\parallel}_{0}\bm{x}\in {X}^{*}\}$ be the least number of nonzero variables of any optimal solution, namely, the Dimension of Sparsest LP–solutions. Let $\mathcal{D}\mathcal{S}\mathcal{L}=\mathrm{arg}\mathrm{min}\{{\parallel \bm{x}\parallel}_{0}\bm{x}\in {X}^{*}\}$ be the set of such solutions. For any ${\bm{x}}^{*}\in \mathcal{D}\mathcal{S}\mathcal{L}$, let $\mathcal{K}({\bm{x}}^{*})=\{k\in [K]{\bm{x}}^{*}\ne 0\}\subseteq [K]$ be the subset of dimensions that are nonzero in ${\bm{x}}^{*}$.
Theorem 1
For any problem instance $\mathrm{I}\mathrm{=}\mathrm{(}m\mathrm{,}n\mathrm{,}K\mathrm{,}\mathbf{p}\mathrm{,}\mathbf{q}\mathrm{,}A\mathrm{)}$, there is an associated $\mathrm{DSL}$ number. Any policy $\pi \mathrm{\in}\mathrm{\Pi}\mathit{}\mathrm{[}\mathrm{DSL}\mathrm{}\mathrm{2}\mathrm{]}$ earns an expected revenue
$\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\le {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\mathrm{\Omega}(T)$ 
We outline three key steps here and defer the details of our proof to Appendix id1. We first identify a clean event, such that the realized demands are close to the expected demands that the LP suggests. This clean event happens with high probability $(1\frac{2}{{T}^{3}})$. In the second step, conditioning on such event, the maximum amount of revenue we generate is no more than $O(\sqrt{T})$ compared to what the LP suggests; and the minimum amount of inventory demanded is no less than $O(\sqrt{T})$ compared to what the LP suggests, resulting in no more than $O(\sqrt{T})$ of realized revenue. In the third step, we show that the revenue loss from insufficient price changes is scaled in the order of $O(T)$, which dominates the $O(\sqrt{T})$ amount revenue due to randomness. Such clean event analysis, originating from the online learning literature (Badanidiyuru et al. 2013, Lattimore and Szepesvári 2018, Slivkins 2019), is new to the stochastic control literature.
We highlight that the lower bound established in Theorem 1 is a perinstance lower bound, as it holds for every single problem instance. Such bound is much stronger than the worstcase type lower bounds that are omnipresent in the revenue management literature, which only holds for a specific instance, see the following example in our setting.
Proposition 1 (Proposition 4, Bumpensanti and Wang (2018); Lemma 5, Ma et al. (2018))
There exists a problem instance ${\mathrm{I}}_{\mathrm{0}}\mathrm{=}\mathrm{(}m\mathrm{,}n\mathrm{,}K\mathrm{,}\mathbf{p}\mathrm{,}\mathbf{q}\mathrm{,}A\mathrm{)}$, such that any policy $\pi \mathrm{\in}\mathrm{\Pi}$ earns an expected revenue
$\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\le {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\mathrm{\Omega}(\sqrt{T})$ 
The static control policy is a well known policy in the literature that has an asymptotic guarantee of $O(\sqrt{T})$ revenue loss. See Gallego and Van Ryzin (1997), Cooper (2002), Maglaras and Meissner (2006), Liu and Van Ryzin (2008). In this section, we modify the wellestablished static control policy to prove a similar result in our setting, when the selling horizon stops immediately when the total cumulative demand of any resource exceeds its initial inventory. Our setting requires different techniques, because the papers listed above on the static control policy do not explicitly address when the cumulative demand would exceed the initial inventory. When this does happen halfway, the static control policy loses chances to gain revenue in the remaining periods, even if there are still other remaining inventory.
Nonetheless, we tweak the static control policy, so that with high probability the selling horizon never stops earlier than the last period $T$. This is achieved by selecting the value of $\gamma $ in the first step of Algorithm 1. Our policy has an asymptotic guarantee of $O(\sqrt{\frac{\mathrm{log}T}{T}}){\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}$ revenue loss, which is slightly worse off by a $\sqrt{\mathrm{log}T}$ factor than the revenue loss in the classical setting. The $\mathrm{log}T$ term comes from the Chernoff bound using concentration inequalities. Similar ideas have been used in Hajiaghayi et al. (2007), Ma et al. (2018), Balseiro et al. (2019) to prove asymptotic results in similar but different settings.
We explain the third step permutation. Suppose $\mathcal{K}({\bm{x}}^{*})=\{1,3,4\}$. In this case, $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}=3$ and there are $6$ permutations. There are $6$ possible policies as suggested in Definition 1. Some of these policies have better performance than others. But in an asymptotic regime, they all have sublinear regret, compared to the DLP upper bound.
Theorem 2
Any policy $\pi $ as defined in Algorithm 1 earns an expected revenue
$\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\ge {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\left(1O\left(\sqrt{{\displaystyle \frac{\mathrm{log}T}{T}}}\right)\right)={\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\stackrel{~}{O}(\sqrt{T}).$ 
We outline two key steps here and defer the details of our proof to Appendix id1. In the first step, we show that with high probability, the selling horizon never stops earlier than the last period $T$. Second, conditioning on this high probability event, the expected revenue is at least $\gamma $ fraction of the LP objective. Putting two steps together we know the total revenue loss is in the order of $1\gamma $.
Combining Theorem 2 with Theorem 1, we identify an intrinsic performance gap when the switching budget is around a critical threshold $\mathrm{\Lambda}1$. See Figure 3. Combining Theorem 2 with Proposition 1, we know that the Tweaked LP policy is asymptotically optimal (in the sense that it achieves the optimal revenue loss rate relative to the DLP upper bound), using only a few switches.
We keep studying the model described in Section id1, in the distributionallyunknown setting. Given price ${\bm{p}}_{k}$, the demand for each product $j\in [n]$ is an unknown Bernoulli^{2}^{2} 2 All our results can be directly extended to general unknown bounded random variables without further steps. random variable, ${Q}_{j,k}:={Q}_{j}({\bm{p}}_{k})\in \{0,1\}$. That is, the expected value ${q}_{j,k}:=\mathbb{E}[{Q}_{j,k}]$ is unknown and has to be sequentially learned over time. Our analysis can be directly generalized to general bounded distributions ${Q}_{j,k}\in [0,1]$, similar to most online learning literature (Lattimore and Szepesvári 2018, Slivkins 2019). The main tradeoff we would like to address in this setting is between exploration and exploitation. Let $\bm{q}={({q}_{j,k})}_{j\in [n],k\in [K]}$ denote the true “mean demand matrix” that is unknown to the seller. We also call $\bm{Q}$ an “environment” The seller wants to learn the true environment while concurrently maximizing the expected total revenue over $T$ periods.
We adopt the following asymptotic regime that is also considered in Besbes and Zeevi (2012) and Ferreira et al. (2018): (1) the problem parameters $(m,n,K,\bm{p},A)$ are fixed, and (2) $T$ and ${B}_{j},\forall j\in [m]$ are at the same comparable order, and much larger than either one of $(m,n,K,\bm{p},A)$. In other words, ${B}_{i}=\mathrm{\Theta}(T)$ for all $i\in [m]$, while $m,n,K,\bm{p},A=O(1)$. Without loss of generality, we assume that $m+1\le K$ (otherwise there are always redundant resource constraints that can be eliminated).
Let $\varphi $ denote any nonanticipating learning policy, and ${\varphi}_{t}\in [k]$ denote the price vector chosen by policy $\varphi $ at period $t\in [T]$. More formally, ${\varphi}_{t}$ establishes a probability kernel acting from the space of historical actions and observations to the space of actions at period $t$. For any $s\in \mathbb{N}$, let $\mathrm{\Phi}[s]$ be the set of learning policies that change price vectors for no more than $s$ times almost surely for all $\bm{Q}$. For any $s,{s}^{\prime}\in \mathbb{N}$ such that $s\le {s}^{\prime}$, $\mathrm{\Phi}[s]\subseteq \mathrm{\Phi}[{s}^{\prime}]$. Let $\mathrm{\Phi}:={lim}_{s\to +\mathrm{\infty}}\mathrm{\Phi}[s]$ be the set of learning policies with infinite switching budgets. Let ${\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\varphi )$ be the expected revenue that a learning policy $\varphi $ generates under environment $\bm{Q}$.
Let $\pi $ denote any clairvoyant policy with the full knowledge of the true environment $\bm{Q}$. For any $s\in \mathbb{N}$, let ${\mathrm{\Pi}}_{\bm{Q}}[s]$ be the set of clairvoyant policies that change price vectors for no more than $s$ times under the true environment $\bm{Q}$. For any $s,{s}^{\prime}\in \mathbb{N}$ such that $s\le {s}^{\prime}$, ${\mathrm{\Pi}}_{\bm{Q}}[s]\subseteq {\mathrm{\Pi}}_{\bm{Q}}[{s}^{\prime}]$. Let ${\mathrm{\Pi}}_{\bm{Q}}:={lim}_{s\to +\mathrm{\infty}}{\mathrm{\Pi}}_{\bm{Q}}[s]$ be the set of clairvoyant policies with infinite switching budgets. Let ${\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\pi )$ be the expected revenue that a clairvoyant policy $\pi \in {\mathrm{\Pi}}_{\bm{Q}}$ generates under environment $\bm{Q}$. Let ${\pi}_{\bm{Q}}^{*}[s]=\mathrm{arg}{sup}_{\pi \in {\mathrm{\Pi}}_{\bm{Q}}[s]}\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )$ be the optimal clairvoyant policy with an switching budget of $s$.
The performance of an $s$switch learning policy $\varphi \in \mathrm{\Phi}[s]$ is measured against the performance of the optimal $s$switch clairvoyant policy ${\pi}_{\bm{Q}}^{*}[s]$. Specifically, we define the $s$switch regret of a learning policy $\varphi \in \mathrm{\Phi}[s]$ as the worstcase difference between the expected revenue of the optimal $s$switch clairvoyant policy ${\pi}_{\bm{Q}}^{*}[s]$ and the expected revenue of policy $\varphi $:
$${R}_{s}^{\varphi}(T)=\underset{\bm{Q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}({\pi}_{\bm{Q}}^{*}[s]){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\varphi )\right\}.$$ 
We also measure the performance of policy $\varphi $ against the performance of the optimal unlimitedswitch clairvoyant policy ${\pi}_{\bm{Q}}^{*}:={\pi}_{\bm{Q}}^{*}[\mathrm{\infty}]$. Specifically, we define the overall regret of a learning policy $\varphi \in \mathrm{\Phi}[s]$ as the worstcase difference between the expected revenue of the optimal unlimitedswitch clairvoyant policy ${\pi}_{\bm{Q}}^{*}$ and the expected revenue of policy $\varphi $:
$${R}^{\varphi}(T)=\underset{\bm{Q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}({\pi}_{\bm{Q}}^{*}){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\varphi )\right\}.$$ 
Intuitively, the $s$switch regret ${R}_{s}^{\varphi}(T)$ measures the “informational revenue loss” due to not knowing the demand distributions, while the overall regret ${R}^{\varphi}(T)$ measures the “overall revenue loss” due to not knowing the demand distributions and not being able to switch freely. Clearly, the overall regret ${R}^{\varphi}(T)$ is always larger than the $s$switch regret ${R}_{s}^{\varphi}(T)$. In addition, according to the results in Section id1, we know that for all $s\ge m$,
${R}^{\varphi}(T)$  $=\underset{\bm{Q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}({\pi}_{\bm{Q}}^{*}){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\varphi )\right\}$  
$\le \underset{\bm{Q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}({\pi}_{\bm{Q}}^{*}[s]){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{Q}}(\varphi )\right\}+\underset{\bm{q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}[s])\right\}$  
$={R}_{s}^{\varphi}(T)+O(\sqrt{T});$ 
and for all $$, there exist $A,\bm{p}$ such that
${R}^{\varphi}(T)$  $=\underset{\bm{q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}(\varphi )\right\}$  
$=\underset{\bm{q}}{sup}\left\{{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}[s]){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}(\varphi )+{\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}){\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}}_{\bm{q}}({\pi}_{\bm{Q}}^{*}[s])\right\}$  
$={R}_{s}^{\varphi}(T)+\mathrm{\Omega}(T);$ 
Consider the DLP defined by (theatequationgroupatID), (theatequationgroupatID), (theatequationgroupatID), and (theatequationgroupatID). Let ${\text{DLP}}_{\bm{Q}}$ denote the DLP when its environment is $\bm{Q}$. Let ${\text{\U0001d5a9}}_{\bm{Q}}^{\text{DLP}}(\bm{x})$ denote the objective value of a feasible solution $\bm{x}$ to ${\text{DLP}}_{\bm{Q}}$. Let ${\text{\U0001d5a9}}_{\bm{Q}}^{\text{DLP}}$ denote the optimal objective value of ${\text{DLP}}_{\bm{Q}}$.
We propose a policy that provides an upper bound on both the $s$switch regret and the overall regret. Our policy, called the $S$Switch Balanced Exploration with LPbased Exploitation (SSBELE) policy, is described in Algorithm 2. The policy presents a novel combination of the ingredients from the SSSE policy proposed in SimchiLevi and Xu (2019), the Balanced Exploration policy proposed in Badanidiyuru et al. (2013), and the Tweaked LP policy defined in Algorithm 1.
Input: $K,T,\bm{B},s,m,n,A,\bm{p}$.
Initialization: Calculate $\nu (s,m)=\lfloor \frac{sm1}{K1}\rfloor $. Define ${t}_{0}=0$ and
$${t}_{j}=\lfloor {K}^{1\frac{2{2}^{(i1)}}{2{2}^{\nu (s,m)}}}{T}^{\frac{2{2}^{(i1)}}{2{2}^{\nu (s,m)}}}\rfloor ,\forall j=1,\mathrm{\dots},\nu (s,m)+1.$$ 
Define $\gamma =\frac{{B}_{\mathrm{min}}}{{B}_{\mathrm{min}}+5{a}_{\mathrm{max}}\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T{t}_{1}}$.
Let ${I}_{1}=\{\bm{Q}\mid 0\le {q}_{j,k}\le 1,\forall j\in [n],k\in [K]\}$.
Let ${T}_{l}$ denote the ending period of epoch $l$ (which will be determined by the algorithm). Let ${T}_{0}=0$.
Let ${z}_{t}$ denote the algorithm’s action at period $t$. Let ${z}_{0}$ be a random action in $\{{\bm{p}}_{1},{\bm{p}}_{2},\mathrm{\dots},{\bm{p}}_{K}\}$.
Policy:
\[email protected]@algorithmic[1]
\FORepoch $l=1,\mathrm{\dots},\nu (s,m)$
\STATE(Skip this step if $l=1$) Let ${n}_{k}({T}_{l1})$ be the total number of periods that price vector ${\bm{p}}_{k}$ is chosen up to period ${T}_{l1}$, and ${\overline{q}}_{j,k}({T}_{l1})$ be the empirical mean demand of product $j$ sold at price vector ${\bm{p}}_{k}$ up to period ${T}_{l1}$. Calculate ${r}_{k}({T}_{l1})=\sqrt{\frac{\mathrm{log}\left[(m+1)KT\right]}{{n}_{k}({T}_{l1})}}$. Update
$${I}_{l}={I}_{l1}\cap \left\{\bm{Q}\right\sum _{j\in [n]}{p}_{j,k}{q}_{j,k}\in [{L}_{k}^{rev}({T}_{l1}),{U}_{k}^{rev}({T}_{l1})],\sum _{j\in [n]}{a}_{ij}{q}_{j,k}\in [{L}_{i,k}^{cost}({T}_{l1}),{U}_{i,k}^{cost}({T}_{l1})]\},$$ 
where
$$\{\begin{array}{cc}{U}_{k}^{rev}({T}_{l1})={\sum}_{j\in [n]}{p}_{j,k}{\overline{q}}_{j,k}({T}_{l1})+{{\bm{p}}_{k}}_{2}{r}_{k}({T}_{l1}),\hfill & \\ {L}_{k}^{rev}({T}_{l1})={\sum}_{j\in [n]}{p}_{j,k}{\overline{q}}_{j,k}({T}_{l1}){{\bm{p}}_{k}}_{2}{r}_{k}({T}_{l1}),\hfill & \end{array}\forall k\in [K],$$ 
$$\{\begin{array}{cc}{U}_{i,k}^{cost}({T}_{l1})={\sum}_{j\in [n]}{a}_{ij}{\overline{q}}_{j,k}({T}_{l1})+{{A}_{i}}_{2}{r}_{k}({T}_{l1}),\hfill & \\ {L}_{i,k}^{cost}({T}_{l1})={\sum}_{j\in [n]}{a}_{ij}{\overline{q}}_{j,k}({T}_{l1}){{A}_{i}}_{2}{r}_{k}({T}_{l1}),\hfill & \end{array}\forall i\in [m],\forall k\in [K].$$ 
Compute the set ${\mathrm{\Delta}}_{l}=\{\bm{x}\mid \exists \bm{Q}\in {I}_{l}\text{such that}\bm{x}{\text{is an optimal solution to DLP}}_{\bm{Q}}\}$. \STATEFor each $i\in [K]$, pick any solution ${\bm{x}}^{l,i}\in {\mathrm{\Delta}}_{l}$ such that ${({\bm{x}}^{l,i})}_{i}\ge \frac{1}{2}{\mathrm{max}}_{{\bm{x}}^{\prime}\in {\mathrm{\Delta}}_{l}}{({\bm{x}}^{\prime})}_{i}$. Let ${N}_{k}^{l}={\sum}_{i=1}^{K}\frac{{t}_{l}{t}_{l1}}{KT}{({\bm{x}}^{l,i})}_{k}$ for all $k\in [K]$. \STATELet ${z}_{{T}_{l1}+1}={z}_{{T}_{l1}}$. Starting from this action, choose each price vector ${\bm{p}}_{k}$ for $\gamma {N}_{k}^{l}$ consecutive periods, $k\in [K]$ (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. \STATEEnd epoch $l$. Mark the last chosen action in epoch $l$ as ${z}_{{T}_{l}}$. \ENDFOR\STATEFor epoch $\nu (s,m)+1$, calculate $\stackrel{~}{\bm{q}}={\left({\overline{q}}_{j,k}({t}_{\nu (s,m)})\right)}_{j\in [n],k\in [K]}$. Solve ${\text{DLP}}_{\stackrel{~}{\bm{Q}}}$ and find an optimal solution to ${\text{DLP}}_{\stackrel{~}{\bm{Q}}}$ with the least number of nonzero variables, ${\bm{x}}_{\stackrel{~}{\bm{Q}}}^{*}\in \mathcal{D}\mathcal{S}{\mathcal{L}}_{\stackrel{~}{\bm{Q}}}$. Let ${N}_{k}^{\nu (s,m)+1}=\frac{(T{t}_{\nu (s,m)})}{T}{({\bm{x}}_{\stackrel{~}{\bm{Q}}}^{*})}_{k}$ for all $k\in [K]$. Let ${z}_{{T}_{\nu (s,m)}+1}={z}_{{T}_{\nu (s,m)}}$. Starting from this action, choose each price vector ${\bm{p}}_{k}$ for $\gamma {N}_{k}^{\nu (s,m)+1}$ consecutive periods, $k\in [K]$ (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. End epoch $\nu (s,m)+1$.
We show that the SSBELE policy is indeed an $s$switch policy and establish the following upper bound on its regret.
Theorem 3
Let $\varphi $ be the SSBELE policy, then $\varphi \mathrm{\in}\mathrm{\Phi}\mathit{}\mathrm{[}s\mathrm{]}$ and
$${R}_{s}^{\varphi}(T)\le {R}^{\varphi}(T)=\stackrel{~}{O}({T}^{\frac{1}{2{2}^{\nu (s,m)}}}),$$ 
where $\nu \mathit{}\mathrm{(}s\mathrm{,}m\mathrm{)}\mathrm{=}\mathrm{\lfloor}\frac{s\mathrm{}m\mathrm{}\mathrm{1}}{K\mathrm{}\mathrm{1}}\mathrm{\rfloor}$.
Let $\stackrel{~}{l}$ be the last epoch in the execution of policy $\varphi $, and let $\tau $ be the last period before the policy $\varphi $ stops. We know that $\tau +1$ is a stopping time and we have $$. Since $\varphi $ makes at most $(K1)(\stackrel{~}{l}1)$ switches before ${T}_{\stackrel{~}{l}1}$ and makes at most $m+1$ switches after ${T}_{\stackrel{~}{l}1}$, its total number of switches is always upper bounded by $(K1)\nu (s,m)+(m+1)\le s$.
We use a coupling argument for the regret analysis. Consider a virtual policy ${\varphi}^{\U0001d5cf}$ that runs under exactly the same demand realization process and acts exactly the same as $\varphi $ until period $\tau $, but keeps running until the end of epoch $\nu (s,m)+1$ regardless of the inventory constraints. Without conflicts to the previously defined notations in Algorithm 2, let ${T}_{l}$ denote the last period of epoch $l$ under policy ${\varphi}^{\U0001d5cf}$ ($l\in [\nu (s,m)+1]$), let ${n}_{k}(t)$ be the total number of periods that price vector ${\bm{p}}_{k}$ is chosen by ${\varphi}^{\U0001d5cf}$ during period 1 to $t$ ($k\in [K],t\in [{T}_{\nu (s,m)+1}]$), and let ${\overline{q}}_{j,k}(t)$ be the average realized demand of product $j$ sold at price vector ${\bm{p}}_{k}$ during period 1 to $t$ ($j\in [n],k\in [K],t\in [T]$). Define the confidence radius as
$${r}_{k}(t)=\sqrt{\frac{\mathrm{log}[(m+1)KT]}{{n}_{k}(t)}},\forall k\in [K],t\in [T].$$ 
Define the clean event $\mathcal{E}$ as
$$\left\{\forall i\in [m],k\in [K],t\in [T],\{\begin{array}{cc}\left{\sum}_{j\in [n]}{a}_{ij}{\overline{q}}_{j,k}(t){\sum}_{j\in [n]}{a}_{ij}{q}_{j,k}\right\le {{A}_{i}}_{2}{r}_{k}(t),\hfill & \\ \left{\sum}_{j\in [n]}{p}_{j,k}{\overline{q}}_{j,k}(t){\sum}_{j\in [n]}{p}_{j,k}{q}_{j,k}\right\le {{\bm{p}}_{k}}_{2}{r}_{k}(t).\hfill & \end{array}\right\}.$$ 
By the Hoeffding’s inequality for general bounded random variables and arguments analogous to Lemma 1.5 in Slivkins (2019), we have $\text{Pr}(\mathcal{E})\ge 1\frac{2}{(m+1)KT}$ under environment $\bm{Q}$ and policy ${\varphi}^{\U0001d5cf}$. Since the clean event happens with very high probability, we can just focus on a clean execution of policy ${\varphi}^{\U0001d5cf}$: an execution in which the clean event holds.
Conditional on the clean event, for all $i\in [m]$ and $l\in [\nu (s,m)+1]$, we have
$\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{\overline{q}}_{j,k}({T}_{l}){n}_{k}({T}_{l})$  
$\le $  $\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{n}_{k}({T}_{l})+{\displaystyle \sum _{k\in [K]}}{{A}_{i}}_{2}{r}_{k}({T}_{l}){n}_{k}({T}_{l})$  
$=$  $\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{n}_{k}({T}_{l})+{{A}_{i}}_{2}\sqrt{\mathrm{log}[(m+1)KT]}{\displaystyle \sum _{k\in [K]}}\sqrt{{n}_{k}({T}_{l})$  
$\le $  $\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{n}_{k}({T}_{l})+{{A}_{i}}_{2}\sqrt{\mathrm{log}[(m+1)KT]}\sqrt{K{T}_{l}},$  (5) 
where the last inequality is by Jensen’s inequality. For notational convenience, define ${\bm{x}}^{\nu (s,m)+1,i}={\bm{x}}_{\stackrel{~}{\bm{Q}}}^{*}$ for all $i\in [K]$. Then we have ${n}_{k}({T}_{l})={\sum}_{r=1}^{l}\gamma {N}_{k}^{r}=\gamma {\sum}_{r=1}^{l}{\sum}_{o=1}^{K}\frac{{t}_{r}{t}_{r1}}{KT}{({\bm{x}}^{r,o})}_{k}$ for all $l\in [\nu (s,m)+1]$. For all $l\in [\nu (s,m)+1],o\in [K]$, let ${\bm{Q}}^{r,o}\in {I}_{l}$ denote an environment such that ${\bm{x}}^{r,o}$ is an optimal solution to ${\text{DLP}}_{{\bm{Q}}^{r,o}}$. Then for all $i\in [m]$ and $l\in [\nu (s,m)+1]$, we have
$\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{n}_{k}({T}_{l})$  
$=$  $\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{1}{KT}}{({\bm{x}}^{r,o})}_{k}$  
$\le $  $\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{k\in [K]}}\left({\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}^{r,o}+2{{A}_{i}}_{2}{r}_{k}({T}_{r1})\right){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{1}{KT}}{({\bm{x}}^{r,o})}_{k}$  
$\le $  $\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{{B}_{i}}{KT}}+\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{2{{A}_{i}}_{2}}{KT}}{\displaystyle \sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k}$  
$=$  $\gamma {t}_{l}{\displaystyle \frac{{B}_{i}}{T}}+\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{2{{A}_{i}}_{2}}{KT}}{\displaystyle \sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k},$  (6) 
where the first inequality is by the definition of ${I}_{r}$ and the clean execution of policy ${\varphi}^{\U0001d5cf}$, and the second inequality is by the feasibility of ${\bm{x}}^{l,o}$ to ${\text{DLP}}_{{\bm{Q}}^{r,o}}$. We now bound ${\sum}_{k\in [K]}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k}$. Since ${I}_{r}\subset {I}_{r1}\subset \mathrm{\cdots}\subset {I}_{1}$, we have ${\mathrm{\Delta}}_{r}\subset {\mathrm{\Delta}}_{r1}\subset \mathrm{\cdots}\subset {\mathrm{\Delta}}_{1}$. By the specification of Algorithm 2, we have ${({\bm{x}}^{i,k})}_{k}\ge \frac{1}{2}{({\bm{x}}^{r,o})}_{k}$ for all $$. Hence
$${n}_{k}({T}_{r1})=\sum _{i=1}^{r1}\gamma {N}_{k}^{i}\ge \gamma \sum _{i=1}^{r1}\frac{{t}_{i}{t}_{i1}}{KT}{({\bm{x}}^{i,k})}_{k}\ge \gamma \sum _{i=1}^{r1}\frac{{t}_{i}{t}_{i1}}{2KT}{({\bm{x}}^{r,o})}_{k}=\gamma \frac{{t}_{r1}}{2KT}{({\bm{x}}^{r,o})}_{k}.$$ 
Thus
$\sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k}={\displaystyle \sum _{k\in [K]}}\sqrt{{\displaystyle \frac{\mathrm{log}[(m+1)KT]}{{n}_{k}({T}_{r1})}}}{({\bm{x}}^{r,o})}_{k$  
$\le $  $\sum _{k\in [K]}}\sqrt{{({\bm{x}}^{r,o})}_{k}}\sqrt{{\displaystyle \frac{2KT\mathrm{log}[(m+1)KT]}{\gamma {t}_{r1}}}}\le KT\sqrt{{\displaystyle \frac{2\mathrm{log}[(m+1)KT]}{\gamma {t}_{r1}}}},$  (7) 
where the last inequality is by (theatequationgroupatID) and Jensen’s inequality. Combining (theequationatIDv) and (theequationatIDab), we have
$\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{n}_{k}({T}_{l})\le \gamma {\displaystyle \frac{{B}_{i}}{T}}{t}_{l}+2{{A}_{i}}_{2}K\sqrt{2\gamma \mathrm{log}[(m+1)KT]}{\displaystyle \sum _{r\in [l]}}{\displaystyle \frac{{t}_{r}{t}_{r1}}{\sqrt{{t}_{r1}}}$  
$\le $  $\gamma {\displaystyle \frac{{B}_{i}}{T}}{t}_{l}+\left(4{{A}_{i}}_{2}\sqrt{K\mathrm{log}[mKT]}\mathrm{log}T\right){t}_{1}\le \gamma {\displaystyle \frac{{B}_{i}}{T}}{t}_{l}+\left(4{a}_{\mathrm{max}}\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T\right){t}_{1}.$  (8) 
Combining (theequationatIDr) and (theequationatIDad), we know that for all $i\in [m]$ and $l=\nu (s,m)+1$,
$$\sum _{k\in [K]}\sum _{j\in [n]}{a}_{ij}{\overline{q}}_{j,k}({T}_{\nu (s,m)+1}){n}_{k}({T}_{\nu (s,m)+1})\le \gamma \frac{{B}_{i}}{T}{t}_{\nu (s,m)+1}+\left(5{{A}_{i}}_{2}\sqrt{K\mathrm{log}[mKT]}\mathrm{log}T\right){t}_{1}={B}_{i}.$$ 
This indicates that conditional on the clean event, policy ${\varphi}^{\U0001d5cf}$ will not violate any inventory constraint. By the coupling relationship between $\varphi $ and ${\varphi}^{\U0001d5cf}$, we know that $\tau ={T}_{\nu (s,m)+1}$ conditional on the clean execution of policy $\varphi $. Thus conditonal on the clean event, the total revenue collected by policy $\varphi $ is
$\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{\overline{q}}_{j,k}({T}_{\nu (s,m)+1}){n}_{k}({T}_{\nu (s,m)+1})$  
$\ge $  $\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{n}_{k}({T}_{\nu (s,m)+1}){\displaystyle \sum _{k\in [K]}}{{\bm{p}}_{k}}_{2}{r}_{k}({T}_{\nu (s,m)+1}){n}_{k}({T}_{\nu (s,m)+1})$  
$\ge $  $\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{n}_{k}({T}_{\nu (s,m)+1}){p}_{\mathrm{max}}\sqrt{n\mathrm{log}[(m+1)KT]}\sqrt{K{T}_{\nu (s,m)+1}}.$  (9) 
We now bound ${\sum}_{k\in [K]}{\sum}_{j\in [n]}{p}_{j,k}{q}_{j,k}{n}_{k}({T}_{\nu (s,m)+1})$ conditional on the clean event. We have
$\sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{n}_{k}({T}_{\nu (s,m)+1})$  
$=$  $\gamma {\displaystyle \sum _{r\in [\nu (s,m)+1]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{1}{KT}}{({\bm{x}}^{r,o})}_{k}$  
$\ge $  $\gamma {\displaystyle \sum _{r\in [\nu (s,m)+1]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{k\in [K]}}\left({\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}^{r,o}2{{\bm{p}}_{k}}_{2}{r}_{k}({T}_{r1})\right){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{1}{KT}}{({\bm{x}}^{r,o})}_{k}$  
$=$  $\gamma {\displaystyle \sum _{r\in [\nu (s,m)+1]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{{\U0001d5a9}_{{\bm{Q}}^{r,o}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}}{KT}}\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{2{p}_{\mathrm{max}}\sqrt{n}}{KT}}{\displaystyle \sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k}$  
$\ge $  $\gamma {\U0001d5a9}_{\bm{q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\gamma {\displaystyle \sum _{r\in [\nu (s,m)+1]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{{\U0001d5a9}_{\bm{q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}{\U0001d5a9}_{{\bm{q}}^{r,o}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}}{KT}}\left(4{p}_{\mathrm{max}}\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T\right){t}_{1},$  (10) 
where the last inequality is by (theequationatIDab) and
$\gamma {\displaystyle \sum _{r\in [l]}}({t}_{r}{t}_{r1}){\displaystyle \sum _{o\in [K]}}{\displaystyle \frac{2{p}_{\mathrm{max}}\sqrt{n}}{KT}}{\displaystyle \sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}^{r,o})}_{k}\le \left(4{p}_{\mathrm{max}}\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T\right){t}_{1}.$ 
Let ${\bm{x}}_{\bm{q}}^{*}$ denote an optimal solution to ${\text{DLP}}_{\bm{q}}$, we have
${\U0001d5a9}_{\bm{Q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}{\U0001d5a9}_{{\bm{Q}}^{r,o}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\le \left({\U0001d5a9}_{\bm{Q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}+{\displaystyle \sum _{k\in [K]}}{{\bm{p}}_{k}}_{2}{r}_{k}({T}_{r1}){({\bm{x}}_{\bm{q}}^{*})}_{k}\right)\left({\displaystyle \frac{{\sum}_{k\in [K]}{a}_{\mathrm{max}}\sqrt{n}{r}_{k}({T}_{r1}){({\bm{x}}_{\bm{q}}^{*})}_{k}}{{B}_{\mathrm{min}}+{\sum}_{k\in [K]}{a}_{\mathrm{max}}\sqrt{n}{r}_{k}({T}_{r1}){({\bm{x}}_{\bm{q}}^{*})}_{k}}}\right).$  (11) 
By arguments analogous to (theequationatIDab), we have
$\sum _{k\in [K]}}{r}_{k}({T}_{r1}){({\bm{x}}_{\bm{q}}^{*})}_{k}\le KT\sqrt{{\displaystyle \frac{2\mathrm{log}[(m+1)KT]}{\gamma {t}_{r1}}}}.$  (12) 
Combining (theequationatIDag), (theequationatIDaj), (11), (12), we know that the total revenue collected by the policy $\varphi $ is lower bounded by
$$\gamma {\U0001d5a9}_{\bm{q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\left(1+\frac{{\U0001d5a9}_{\bm{q}}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}}{{B}_{\mathrm{min}}}\right)5\mathrm{max}\{{a}_{\mathrm{max}},{p}_{\mathrm{max}}\}\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T{t}_{1}.$$ 
Thus both ${R}_{s}^{\varphi}(T)$ and ${R}^{\varphi}(T)$ are upper bounded by
$$C\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T{t}_{1}=C\sqrt{nK\mathrm{log}[mKT]}\mathrm{log}T{K}^{1\frac{1}{2{2}^{\nu (s,m)}}}{T}^{\frac{1}{2{2}^{\nu (s,m)}}},$$ 
where $C$ is a constant that only depends on ${a}_{\mathrm{max}}$ and ${p}_{\mathrm{max}}$. $\mathrm{\square}$\@endparenv
We prove a matching lower bound for both the $s$switch regret and the overall regret.
Theorem 4
For any $T\mathrm{\ge}\mathrm{1}\mathrm{,}K\mathrm{\ge}\mathrm{1}\mathrm{,}m\mathrm{\ge}\mathrm{0}\mathrm{,}s\mathrm{\ge}\mathrm{0}$, there exists a distrubitionallyunknown problem instance $\mathrm{I}\mathrm{=}\mathrm{\{}\mathrm{(}m\mathrm{,}n\mathrm{,}K\mathrm{,}\mathbf{p}\mathrm{,}A\mathrm{)}\mathrm{\}}$ such that for all $\varphi \mathrm{\in}\mathrm{\Phi}\mathit{}\mathrm{[}s\mathrm{]}$,
$${R}^{\varphi}(T)\ge {R}_{s}^{\varphi}(T)=\stackrel{~}{\mathrm{\Omega}}({T}^{\frac{1}{2{2}^{\nu (s,m)}}}),$$ 
where $\nu \mathit{}\mathrm{(}m\mathrm{,}s\mathrm{)}\mathrm{=}\mathrm{\lfloor}\frac{s\mathrm{}m\mathrm{}\mathrm{1}}{K\mathrm{}\mathrm{1}}\mathrm{\rfloor}$.
Proof Idea. We construct a problem instance as follows. Let there be $n=m+1$ resources. Let
$$A=\left(\begin{array}{cccc}\hfill 1\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill 0\hfill & \hfill \mathrm{\cdots}\hfill & \hfill 1\hfill & \hfill 1\hfill \end{array}\right)$$ 
so that each product only consumes one resource: product $i\in [m1]$ consumes resource $i$, and products $m$ and $(m+1)$ consume resource $m$. For some $\u03f5>0$, let
$${p}_{j,k}=\{\begin{array}{cc}1+\u03f5,\hfill & \text{if}1\le j=k\le m,\hfill \\ 1\u03f5,\hfill & \text{if}j=k=m+1,\hfill \\ 0,\hfill & \text{otherwise.}\hfill \end{array}$$ 
Define $\beta $ to be the unique solution over $(0,+\mathrm{\infty})$ to:
$$\sum _{i=1}^{m1}\frac{{B}_{i}}{\beta +\u03f5}+\frac{{B}_{m}}{\beta}T=0.$$ 
Let
$${q}_{j,k}=\{\begin{array}{cc}\beta +{\mu}_{j},\hfill & \text{if}1\le j=k\le m+1\hfill \\ 0,\hfill & \text{otherwise,}\hfill \end{array}$$ 
where ${\mu}_{1},\mathrm{\dots},{\mu}_{m+1}$ are small gaps that we will assign different values in the lower bound proof. The main property of the above problem instance is that as long as ${\mu}_{1},\mathrm{\dots},{\mu}_{m+1}$ are all smaller than a constant threshold, no matter how we adjust ${\mu}_{1},\mathrm{\dots},{\mu}_{m+1}$, the resulting ${\text{DLP}}_{\bm{q}}$ is always nondegenerate, i.e., the resulting $\mathrm{\Lambda}$ is $m+1$. Then by Theorem 1, even for a clairvoyant policy that knows $\bm{q}$, it needs to make at least $m$ switches to guarantee a sublinear regret. The rest of the proof is built on the “tracking the covering time” argument, originally proposed in SimchiLevi and Xu (2019). However, our proof require more delicate analysis for the values of different LP, and require us to integrate Theorem 1 into the analysis of the last two stopping times.
Combining Theorem 3 and Theorem 4, we can see that for any switching budget $s$, the optimal (worstcase) regret of the distributionallyunknown network revenue management problem is in the order of $\stackrel{~}{\mathrm{\Theta}}\left({T}^{\frac{1}{2{2}^{\lfloor \frac{sm1}{K1}\rfloor}}}\right)$. This suggests that given a fixed switching budget $s$, an increase in the number of resources $m$ may result in an increase in the order of the optimal regret. To the best our knowledge, this is the first result of such kind that explicitly characterizes how the dimension of the inventory constraints makes a stochastic online learning problem “harder”. In other words, an increase in the number of resources increases the required number of switches to achieve a same regret bound. While both the classical multiarmed bandit problem and the bandits with knapsacks problem (Badanidiyuru et al. 2013) / distributionallyunknown network revenue management problem (Besbes and Zeevi 2012, Ferreira et al. 2018) studied in previous literature essentially exhibit the same optimal regret rate in the order of $\stackrel{~}{\mathrm{\Theta}}(\sqrt{T})$, the order of $T$ is not affected by the dimension of the inventory constraints. Our results indicate an intrinsic separation in the optimal regret bounds, due to the switching budget constraint.
Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems and in Perchet et al. (2016) for stochastic bandit problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. As a concrete example of limitedadapativity problems, the $M$batched multiarmed bandit problem is a variant of the classical MAB problem where the decision maker must split her learning process into $M$ batches and is only able to observe data from a given batch after the entire batch is completed, see Perchet et al. (2016), Gao et al. (2019).
Our SSBELE policy is not only a limitedswitch policy, but also a limitedadaptivity policy — throughout the entire $T$ periods, it only queries the collected data for at most $\nu (s,m)$ times. Indeed, the policy runs in at most $\nu (s,m)+1$ epochs, and within each epoch the policy’s actions can be predetermined in the beginning of the epoch. Our policy can thus be treated as an $M$batched policy, providing $\stackrel{~}{O}({T}^{\frac{1}{1{2}^{1M}}})$ regret for the $M$batched network revenue management / bandits with knapsacks problems, which generalizes the previous upper bound results for unconstrained batched bandit problems (Perchet et al. 2016, Gao et al. 2019). Moreover, since a regret lower bound for limitedswitch policies is quite strong and always implies a regret lower bound for limitedadaptivity policies (as noted in SimchiLevi and Xu 2019), our lower bound in Theorem 4 directly provides a matching regret lower bound for the $M$batched network revenue management / bandits with knapsacks problems. We thus state the following results:
Proposition 2
For a distributionallyunknown network revenue management problem parameterized by fixed ($m$, $n$, $\mathbf{B}$, $T$, $K$, $\mathbf{p}$), there exists a mapping between the family of limited adaptivity problems and the family of limited switching budget problems, such that following each mapping, the two corresponding problems have the same optimal regret (up to logarithmic factors).
Note that SimchiLevi and Xu (2019) establishes a similar “regret equivalence” result between limited switches and limited adaptivity for the classical MAB problem without inventory constraints. Our regret equivalence result is a generalization of their result to inventoryconstrained settings.
References
 Adelman (2007) Adelman D (2007) Dynamic bid prices in revenue management. Operations Research 55(4):647–661.
 Agrawal et al. (1988) Agrawal R, Hedge M, Teneketzis D (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10):899–906.
 Agrawal et al. (1990) Agrawal R, Hegde M, Teneketzis D (1990) Multiarmed bandit problems with multiple plays and switching cost. Stochastics and Stochastic reports 29(4):437–459.
 Alaei et al. (2012) Alaei S, Hajiaghayi M, Liaghat V (2012) Online prophetinequality matching with applications to ad allocation. Proceedings of the 13th ACM Conference on Electronic Commerce, 18–35 (ACM).
 Araman and Caldentey (2009) Araman VF, Caldentey R (2009) Dynamic pricing for nonperishable products with demand learning. Operations research 57(5):1169–1188.
 Arlotto and Gurvich (2019) Arlotto A, Gurvich I (2019) Uniformly bounded regret in the multisecretary problem. Stochastic Systems 9(3):231–260.
 Arlotto and Xie (2018) Arlotto A, Xie X (2018) Logarithmic regret in the dynamic and stochastic knapsack problem. arXiv preprint arXiv:1809.02016 .
 Badanidiyuru et al. (2013) Badanidiyuru A, Kleinberg R, Slivkins A (2013) Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 207–216 (IEEE).
 Ball and Queyranne (2009) Ball MO, Queyranne M (2009) Toward robust revenue management: Competitive analysis of online booking. Operations Research 57(4):950–963.
 Balseiro et al. (2019) Balseiro SR, Brown DB, Chen C (2019) Dynamic pricing of relocating resources in large networks. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 29–30 (ACM).
 Besbes and Zeevi (2012) Besbes O, Zeevi A (2012) Blind network revenue management. Operations research 60(6):1537–1550.
 Bumpensanti and Wang (2018) Bumpensanti P, Wang H (2018) A resolving heuristic with uniformly bounded loss for network revenue management. arXiv preprint arXiv:1802.06192 .
 CesaBianchi et al. (2013) CesaBianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 1160–1168.
 Chen and Chao (2019) Chen B, Chao X (2019) Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
 Chen and Homemde Mello (2010) Chen L, Homemde Mello T (2010) Resolving stochastic programming models for airline revenue management. Annals of Operations Research 177(1):91–114.
 Chen et al. (2015) Chen Q, Jasin S, Duenyas I (2015) Realtime dynamic pricing with minimal and flexible price adjustment. Management Science 62(8):2437–2455.
 Chen and Farias (2013) Chen Y, Farias VF (2013) Simple policies for dynamic pricing with imperfect forecasts. Operations Research 61(3):612–624.
 Chen and Goldberg (2018) Chen Y, Goldberg D (2018) Beating the curse of dimensionality in options pricing and optimal stopping. ArXiv eprints .
 Chen and Shi (2019) Chen Y, Shi C (2019) Network revenue management with online inverse batch gradient descent method. Available at SSRN .
 Cheung et al. (2017) Cheung WC, SimchiLevi D, Wang H (2017) Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6):1722–1731.
 Ciocan and Farias (2012) Ciocan DF, Farias V (2012) Model predictive control for dynamic resource allocation. Mathematics of Operations Research 37(3):501–525.
 Cooper (2002) Cooper WL (2002) Asymptotic behavior of an allocation policy for revenue management. Operations Research 50(4):720–727.
 Dean et al. (2005) Dean BC, Goemans MX, Vondrák J (2005) Adaptivity and approximation for stochastic packing problems. Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, 395–404 (Society for Industrial and Applied Mathematics).
 Dean et al. (2008) Dean BC, Goemans MX, Vondrák J (2008) Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research 33(4):945–964.
 Dütting et al. (2017) Dütting P, Feldman M, Kesselheim T, Lucier B (2017) Prophet inequalities made easy: Stochastic optimization by pricing nonstochastic inputs. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 540–551 (IEEE).
 Erdelyi and Topaloglu (2011) Erdelyi A, Topaloglu H (2011) Using decomposition methods to solve pricing problems in network revenue management. Journal of Revenue and Pricing Management 10(4):325–343.
 Ferreira et al. (2018) Ferreira KJ, SimchiLevi D, Wang H (2018) Online network revenue management using thompson sampling. Operations research 66(6):1586–1602.
 Gallego et al. (2018) Gallego G, Topaloglu H, et al. (2018) Revenue management and pricing analytics (Springer).
 Gallego and Van Ryzin (1994) Gallego G, Van Ryzin G (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management science 40(8):999–1020.
 Gallego and Van Ryzin (1997) Gallego G, Van Ryzin G (1997) A multiproduct dynamic pricing problem and its applications to network yield management. Operations research 45(1):24–41.
 Gao et al. (2019) Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multiarmed bandits problem. Advances in Neural Information Processing Systems, 501–511.
 Hajiaghayi et al. (2007) Hajiaghayi MT, Kleinberg R, Sandholm T (2007) Automated online mechanism design and prophet inequalities. AAAI, volume 7, 58–65.
 Jasin (2014) Jasin S (2014) Reoptimization and selfadjusting price control for network revenue management. Operations Research 62(5):1168–1178.
 Jasin and Kumar (2012) Jasin S, Kumar S (2012) A resolving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2):313–345.
 Jørgensen et al. (2003) Jørgensen S, Taboubi S, Zaccour G (2003) Retail promotions with negative brand image effects: Is cooperation possible? European Journal of Operational Research 150(2):395–405.
 Jun (2004) Jun T (2004) A survey on the bandit problem with switching costs. De Economist 152(4):513–541.
 Kleinberg and Weinberg (2012) Kleinberg R, Weinberg SM (2012) Matroid prophet inequalities. Proceedings of the fortyfourth annual ACM symposium on Theory of computing, 123–136.
 Lattimore and Szepesvári (2018) Lattimore T, Szepesvári C (2018) Bandit algorithms. preprint 28.
 Levy et al. (1998) Levy D, Dutta S, Bergen M, Venable R (1998) Price adjustment at multiproduct retailers. Managerial and decision economics 19(2):81–120.
 Li and Zheng (2019) Li X, Zheng Z (2019) Dynamic pricing with external information and inventory constraint. Available at SSRN .
 Liu and Van Ryzin (2008) Liu Q, Van Ryzin G (2008) On the choicebased linear programming model for network revenue management. Manufacturing & Service Operations Management 10(2):288–310.
 Ma et al. (2018) Ma W, SimchiLevi D, Zhao J (2018) Dynamic pricing under a static calendar. Available at SSRN 3251015 .
 Maglaras and Meissner (2006) Maglaras C, Meissner J (2006) Dynamic pricing strategies for multiproduct revenue management problems. Manufacturing & Service Operations Management 8(2):136–148.
 Mehta et al. (2005) Mehta A, Saberi A, Vazirani U, Vazirani V (2005) Adwords and generalized online matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), 264–273 (IEEE).
 Netessine (2006) Netessine S (2006) Dynamic pricing of inventory/capacity with infrequent price changes. European Journal of Operational Research 174(1):553–580.
 Perakis and Singhvi (2019) Perakis G, Singhvi D (2019) Dynamic pricing with unknown nonparametric demand and limited price changes. Available at SSRN 3336949 .
 Perchet et al. (2016) Perchet V, Rigollet P, Chassang S, Snowberg E, et al. (2016) Batched bandit problems. The Annals of Statistics 44(2):660–681.
 Reiman and Wang (2008) Reiman MI, Wang Q (2008) An asymptotically optimal policy for a quantitybased network revenue management problem. Mathematics of Operations Research 33(2):257–282.
 Secomandi (2008) Secomandi N (2008) An analysis of the controlalgorithm resolving issue in inventory and revenue management. Manufacturing & Service Operations Management 10(3):468–483.
 SimchiLevi and Xu (2019) SimchiLevi D, Xu Y (2019) Phase transitions and cyclic phenomena in bandits with switching constraints. Advances in Neural Information Processing Systems, 7521–7530.
 Slivkins (2019) Slivkins A (2019) Introduction to multiarmed bandits. arXiv preprint arXiv:1904.07272 .
 Talluri and Van Ryzin (2006) Talluri KT, Van Ryzin GJ (2006) The theory and practice of revenue management, volume 68 (Springer Science & Business Media).
 Topaloglu (2009) Topaloglu H (2009) Using lagrangian relaxation to compute capacitydependent bid prices in network revenue management. Operations Research 57(3):637–649.
 Truong and Wang (2019) Truong VA, Wang X (2019) Prophet inequality with correlated arrival probabilities, with application to two sided matchings. arXiv preprint arXiv:1901.02552 .
 Vera and Banerjee (2019) Vera A, Banerjee S (2019) The bayesian prophet: A lowregret framework for online decision making. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 81–82 (ACM).
 Vera et al. (2019) Vera A, Banerjee S, Gurvich I (2019) Online allocation and pricing: Constant regret via bellman inequalities. arXiv preprint arXiv:1906.06361 .
 Wang et al. (2014) Wang Z, Deng S, Ye Y (2014) Close the gaps: A learningwhiledoing algorithm for singleproduct revenue management problems. Operations Research 62(2):318–331.
 Zbaracki et al. (2004) Zbaracki MJ, Ritson M, Levy D, Dutta S, Bergen M (2004) Managerial and customer costs of price adjustment: direct evidence from industrial markets. Review of Economics and statistics 86(2):514–533.
For any problem instance $\mathcal{I}=(m,n,K,\bm{p},\bm{q},A)$. Any policy $\pi \in \mathrm{\Pi}[\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}2]$ only selects no more than $(\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1)$ many price vectors. $\forall l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]$, let ${\bm{p}}_{{a}_{l}}\in \{{\bm{p}}_{1},{\bm{p}}_{2},\mathrm{\dots},{\bm{p}}_{K}\}$ be the ${l}^{\text{th}}$ price that policy $\pi $ selects; let ${\tau}_{{a}_{l}}$ be the total number of periods that price ${\bm{p}}_{{a}_{l}}$ is offered, under policy $\pi $. Notice that both ${a}_{l}$ and ${\tau}_{{a}_{l}}$ may be random, i.e. which price vector to offer and for how long each price vector is offered they both depend on the random trajectory.
Now denote ${Y}_{j,{a}_{l}}$ as the random amount of product $j$ sold, during the periods that price vector ${a}_{l}$ is offered. We know $\mathbb{E}[{Y}_{j,{a}_{l}}{\tau}_{{a}_{l}}]={\tau}_{{a}_{l}}{q}_{j,{a}_{l}}$. Here ${\tau}_{{a}_{l}}$ is a random amount, so we cannot directly use concentration inequalities. But we can adapt the trick from Slivkins (2019). Suppose there was a tape of length $T$ for each product $j$ and each price vector ${a}_{l}$, with each cell independently sampled from the distribution of ${Q}_{j,{a}_{l}}$. This tape serves as a coupling of the random reward: in each period $t$ if price vector ${a}_{l}$ is offered, we simply generate a demand from the ${t}^{\text{th}}$ cell of the tape. Now we can use Hoeffding inequality:
$$\forall {a}_{l},\forall j,\forall {\tau}_{{a}_{l}},\mathrm{Pr}\left(\left{Y}_{j,{a}_{l}}{\tau}_{{a}_{l}}{q}_{j,{a}_{l}}\right\le \sqrt{3{\tau}_{{a}_{l}}\mathrm{log}T}\right)\ge 1\frac{2}{{T}^{6}}.$$ 
Denote the following event $E$:
$\forall {a}_{l},\forall j,\forall {\tau}_{{a}_{l}},\left{Y}_{j,{a}_{l}}{\tau}_{{a}_{l}}{q}_{j,{a}_{l}}\right\le \sqrt{3{\tau}_{{a}_{l}}\mathrm{log}T}.$  (13) 
Using a union bound we have:
$$\mathrm{Pr}(\forall {a}_{l},\forall j,\forall {\tau}_{{a}_{l}},\left{Y}_{j,{a}_{l}}{\tau}_{{a}_{l}}{q}_{j,{a}_{l}}\right\le \sqrt{3{\tau}_{{a}_{l}}\mathrm{log}T})\ge 1\frac{2}{{T}^{3}}$$ 
because $n,m,K$ are all less than $T$. The happening of such event means that the realized demands are close to the expected demands, suggesting that LP is a good proxy of any policy $\pi \in \mathrm{\Pi}[\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}2]$.
Now if we focus on the usage of any price vector indexed by ${a}_{l}\in [K]$, the total expected revenue is ${\sum}_{j\in [n]}{Y}_{j,{a}_{l}}{p}_{j,{a}_{l}}$. Then the total expected revenue generated during the entire horizon can be upper bounded by
$\sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}}{\displaystyle \sum _{j\in [n]}}{Y}_{j,{a}_{l}}{p}_{j,{a}_{l}$  $\le {\displaystyle \sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}}{\displaystyle \sum _{j\in [n]}}({q}_{j,{a}_{l}}{\tau}_{{a}_{l}}+\sqrt{3T\mathrm{log}T}){p}_{j,{a}_{l}}$  
$\le ({\displaystyle \sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}}{\displaystyle \sum _{j\in [n]}}{q}_{j,{a}_{l}}{\tau}_{{a}_{l}}{p}_{j,{a}_{l}})+{n}^{2}{p}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\sqrt{3T\mathrm{log}T},$ 
where the last inequality is because $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}\le (n+1)$. On the other hand, the consumption of inventory $i$ must not violate the inventory constraint.
$$\sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}\sum _{j\in [n]}{Y}_{j,{a}_{l}}{a}_{ij}\le {B}_{i}.$$ 
Lower bounding ${Y}_{j,{a}_{l}}$ by ${q}_{j,{a}_{l}}{\tau}_{{a}_{l}}\sqrt{3T\mathrm{log}T}$ we have
$$\sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}\sum _{j\in [n]}{Y}_{j,{a}_{l}}{a}_{ij}\le {B}_{i}+\sum _{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}\sum _{j\in [n]}\sqrt{3T\mathrm{log}T}\le {B}_{i}+{n}^{2}\sqrt{3T\mathrm{log}T}.$$ 
Now construct the following LP:
${\U0001d5a9}^{\mathrm{\U0001d5af\U0001d5be\U0001d5cb\U0001d5cd\U0001d5ce\U0001d5cb\U0001d5bb\U0001d5be\U0001d5bd}}=\underset{{\{{x}_{k}\}}_{k\in [K]}}{\mathrm{max}}{\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}{x}_{k}$  
$\text{s.t.}{\displaystyle \sum _{k\in [K]}}{\displaystyle \sum _{j\in [n]}}{a}_{ij}{q}_{j,k}{x}_{k}$  $\le {B}_{i}+{n}^{2}\sqrt{3T\mathrm{log}T}$  $\forall i\in [m]$  
$\sum _{k\in [K]}}{x}_{k$  $\le T$  
${x}_{k}$  $\ge 0$  $\forall k\in [K].$ 
If we let $T$ and $\bm{B}$ to scale linearly by $\kappa $, then the perturbed term ${n}^{2}\sqrt{3T\mathrm{log}T}$ is negligible, and so the optimal solution is scaled by $\kappa $. This means that if we restrict ourselves to a solution that uses no more than $(\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1)$ nonzero variables, we incur a linear loss, i.e. $\left({\sum}_{l\in [\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1]}{\sum}_{j\in [n]}{q}_{j,{a}_{l}}{\tau}_{{a}_{l}}{p}_{j,{a}_{l}}\right)=(1\mathrm{\Omega}(1)){\U0001d5a9}^{\mathrm{\U0001d5af\U0001d5be\U0001d5cb\U0001d5cd\U0001d5ce\U0001d5cb\U0001d5bb\U0001d5be\U0001d5bd}}$.
Since from each solution ${\bm{x}}^{*}$ of this perturbed LP, we can find a corresponding discounted solution ${\bm{x}}^{*}\cdot (1+\frac{{n}^{2}\sqrt{3T\mathrm{log}T}}{{B}_{\mathrm{min}}})$ that is feasible to the DLP. This suggests that ${\U0001d5a9}^{\mathrm{\U0001d5af\U0001d5be\U0001d5cb\U0001d5cd\U0001d5ce\U0001d5cb\U0001d5bb\U0001d5be\U0001d5bd}}\le {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\cdot (1+\frac{{n}^{2}\sqrt{3T\mathrm{log}T}}{{B}_{\mathrm{min}}})$, because DLP is a maximization problem.
Putting all together, and conditional on event $E$ that happens with probability at least $1\frac{2}{{T}^{3}}$,
$\mathbb{E}[\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )E]$  $\le (1\mathrm{\Omega}(1)){\U0001d5a9}^{\mathrm{\U0001d5af\U0001d5be\U0001d5cb\U0001d5cd\U0001d5ce\U0001d5cb\U0001d5bb\U0001d5be\U0001d5bd}}+{n}^{2}\sqrt{3T\mathrm{log}T}{p}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}$  
$\le (1\mathrm{\Omega}(1)){\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\cdot (1+{\displaystyle \frac{{n}^{2}\sqrt{3T\mathrm{log}T}}{{B}_{\mathrm{min}}}})+{n}^{2}\sqrt{3T\mathrm{log}T}{p}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}$  
$\le {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\mathrm{\Omega}(T),$ 
which suggests that $\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\le {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}\mathrm{\Omega}(T)$. $\mathrm{\square}$\@endparenv
Let $\pi $ be any policy suggested in Algorithm 1. Let ${\bm{x}}^{*}$ be the associated optimal solution. We prove Theorem 2 by comparing the expected revenue earned by Algorithm 1 against a virtual policy ${\pi}^{\U0001d5cf}$. This virtual policy ${\pi}^{\U0001d5cf}$ mimics Algorithm 1 in steps 1–3. But in step 4, it sets the price vector to be ${\bm{p}}_{\sigma (1)}$ for the first $\gamma \cdot {x}_{\sigma (1)}^{*}$ periods, then ${\bm{p}}_{\sigma (2)}$ for the next $\gamma \cdot {x}_{\sigma (2)}^{*}$ periods, …, ${\bm{p}}_{\sigma (\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab})}$ for the next $\gamma \cdot {x}_{\sigma (\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab})}^{*}$ periods, and finally ${\bm{p}}_{\mathrm{\infty}}$ for the last $(T\gamma \cdot {\sum}_{l=1}^{\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}}{x}_{\sigma (l)}^{*})$ periods. Here ${\bm{p}}_{\mathrm{\infty}}$ is a shutoff price, under which ${Q}_{j}({\bm{p}}_{\mathrm{\infty}})=0,\forall j\in [n]$.
Policy ${\pi}^{\U0001d5cf}$ is virtual because it requires a shutoff price ${\bm{p}}_{\mathrm{\infty}}$ that may or may not be available. Moreover, it requires $\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}$ many price changes, which is more than $(\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}1)$ many changes as suggested in Algorithm 1.
Policy ${\pi}^{\U0001d5cf}$ serves to bridge our analysis. It breaks our Theorem 2 into two inequalities that we will prove separately.
$\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\ge \mathrm{\U0001d5b1\U0001d5be\U0001d5cf}({\pi}^{\U0001d5cf})\ge \left(12{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\sqrt{{\displaystyle \frac{nT\mathrm{log}T}{{B}_{\mathrm{min}}^{2}}}}{\displaystyle \frac{m}{{T}^{2}}}\right)\cdot {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}}$ 
For any policy $\pi $ as defined in Algorithm 1 and its associated virtual policy ${\pi}^{\U0001d5cf}$, they both solve the same DLP and have the same optimal solution. To prove the first inequality, note that both $\pi $ and ${\pi}^{\U0001d5cf}$ commit to the same prices in the first $\stackrel{~}{T}:=\gamma \cdot {\sum}_{l=1}^{\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}}{x}_{\sigma (l)}^{*}$ time periods, and earns the same revenue following each trajectory of random demand. At the end of period $\stackrel{~}{T}$, policy $\pi $ still commits to ${\bm{p}}_{\sigma (\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab})}$, while policy ${\pi}^{\U0001d5cf}$ makes one change and sets ${\bm{p}}_{\mathrm{\infty}}$. At the end of period $\stackrel{~}{T}$, if the selling horizon has ended due to inventory stockouts, then either policy earns zero revenue, so $\pi $ and ${\pi}^{\U0001d5cf}$ makes no difference. If the selling horizon has not ended and there is remaining inventory for any resource, then policy $\pi $ earns nonnegative revenue, while ${\pi}^{\U0001d5cf}$ earns zero by setting a shutoff price. Following each trajectory of random demand, policy $\pi $ earns more revenue than ${\pi}^{\U0001d5cf}$. As a result, $\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}(\pi )\ge \mathrm{\U0001d5b1\U0001d5be\U0001d5cf}({\pi}^{\U0001d5cf})$.
To prove the second inequality, we introduce the following notations. Let ${\text{\U0001d7d9}}_{t,k},\forall k\in [K],t\in [T]$ be an indicator of whether or not policy ${\pi}^{\U0001d5cf}$ offers price ${\bm{p}}_{k}$ in period $t$.
$$ 
${\text{\U0001d7d9}}_{t,k}$ is deterministic once policy ${\pi}^{\U0001d5cf}$ is determined.
Under policy ${\pi}^{\U0001d5cf}$, following each trajectory of random demand, we define a the length of the effective selling horizon $\tau $ as function of a stopping time ${t}_{0}$:
$\tau =\stackrel{~}{T}\wedge \mathrm{min}\left\{{t}_{0}1\exists i,s.t.{\displaystyle \sum _{t=1}^{{t}_{0}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{Q}_{j,k}\cdot {a}_{ij}>{B}_{i}\right\}$ 
The effective selling horizon is the minimum between (i) last period before the cumulative demand of any resource exceeds its initial inventory, and (ii) the last period before policy ${\pi}^{\U0001d5cf}$ switches to the shutoff price.
Let ${C}_{t,i}$ be the remaining inventory of resource $i$ at the end of period $t$. Under this notation, ${C}_{0,i}={B}_{i}$. Note that ${C}_{t,i}$ are random variables, and during the effective selling horizon, inventory updates in the following fashion
$\forall t\in [\tau ],{C}_{t,i}={C}_{t1,i}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{Q}_{j,k}\cdot {a}_{ij}\ge 0$  (14) 
Now we calculate the expected revenue.
$\mathrm{\U0001d5b1\U0001d5be\U0001d5cf}({\pi}^{\U0001d5cf})$  $\ge {\mathbb{E}}_{{Q}_{j,k}({\xi}_{t})}\left[{\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\text{\U0001d7d9}}_{\{\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{Q}_{j,k}\right]$  
$={\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}\mathrm{Pr}(\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}){\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}$  
$\ge \mathrm{Pr}(\forall i,{C}_{\stackrel{~}{T},i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}){\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{p}_{j,k}{q}_{j,k}$  (15) 
We explain the inequalities. The first inequality is because we only focus on the revenue earned if event $\{\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\}$ happens, while ignoring the revenue earned if event $\{\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\}$ does not happen; and when event $\{\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\}$ does happen, the maximum amount of any inventory $i$ demanded in one single period cannot exceed $n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}$. The first equality is expanding the expectations, where we use the fact that ${\text{\U0001d7d9}}_{\{\forall i,{C}_{t1,i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\}}$ and ${\sum}_{k\in [K]}{\text{\U0001d7d9}}_{t,k}{\sum}_{j\in [n]}{p}_{j,k}{Q}_{j,k}$ are independent, because the indicator is a random event happening up to period $t1$, while the summation term is a random amount happening in period $t$. The third inequality is due to (14), ${C}_{t,i}$ is decreasing in $t$, so ${C}_{t1,i}\ge {C}_{\stackrel{~}{T},i}$.
In this block of inequalities (15), the summation term
$$\sum _{t=1}^{\stackrel{~}{T}}\sum _{k\in [K]}{\text{\U0001d7d9}}_{t,k}\sum _{j\in [n]}{p}_{j,k}{q}_{j,k}=\sum _{l=1}^{\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}}\sum _{k\in \{\sigma (l)=k\}}\gamma {x}_{k}^{*}\sum _{j\in [n]}{p}_{j,k}{q}_{j,k}=\gamma {\U0001d5a9}^{\mathrm{\U0001d5a3\U0001d5ab\U0001d5af}},$$ 
since the indicators ${\text{\U0001d7d9}}_{t,k}$ locate which $k$ counts into this summation. So the next thing we do is to lower bound the probability.
Note that
$\mathbb{E}\left[{\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{Q}_{j,k}\cdot {a}_{ij}\right]$  $={\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{q}_{j,k}\cdot {a}_{ij}$  
$={\displaystyle \sum _{l=1}^{\mathrm{\U0001d5a3\U0001d5b2\U0001d5ab}}}{\displaystyle \sum _{k\in \{\sigma (l)=k\}}}\gamma {x}_{k}^{*}{\displaystyle \sum _{j\in [n]}}{q}_{j,k}{a}_{ij}$  
$\le \gamma {B}_{i}$  
$$ 
where the last (strict) inequality is because we plug in $\gamma =12{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\sqrt{\frac{nT\mathrm{log}T}{{B}_{\mathrm{min}}^{2}}}$.
This above inequality suggests that for any $i\in [m]$, the expected cumulative demand generated up till period $\stackrel{~}{T}$ is strictly less than ${B}_{i}n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}$. So we can use concentration inequalities.
$\mathrm{Pr}(\forall i,{C}_{\stackrel{~}{T},i}\ge n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}})$  $=1\mathrm{Pr}(\exists i,s.t.{\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{Q}_{j,k}\cdot {a}_{ij}\ge {B}_{i}n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}})$  
$\ge 1{\displaystyle \sum _{i\in [m]}}\mathrm{Pr}\left({\displaystyle \sum _{t=1}^{\stackrel{~}{T}}}{\displaystyle \sum _{k\in [K]}}{\text{\U0001d7d9}}_{t,k}{\displaystyle \sum _{j\in [n]}}{Q}_{j,k}\cdot {a}_{ij}\gamma {B}_{i}\ge (1\gamma ){B}_{i}n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\right)$  
$\ge 1{\displaystyle \sum _{i\in [m]}}\mathrm{exp}\left({\displaystyle \frac{2{((1\gamma ){B}_{i}n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}})}^{2}}{{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}^{2}nT}}\right)$  
$\ge 1m\mathrm{exp}\left({\displaystyle \frac{2{((1\gamma ){B}_{\mathrm{min}}n{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}})}^{2}}{{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}^{2}nT}}\right)$  
$\ge 1{\displaystyle \frac{m}{{T}^{2}}}$  (16) 
where the first inequality is due to union bound; the second inequality is due to Hoeffding inequality, ${Q}_{j,k}{a}_{ij}$ is bounded by ${a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}$, and there are no more than $n\cdot T$ such terms; the third inequality is because we lower bound each ${B}_{i}$ by ${B}_{\mathrm{min}}$; the last inequality is when we plug in $1\gamma =2{a}_{\mathrm{\U0001d5c6\U0001d5ba\U0001d5d1}}\sqrt{\frac{nT\mathrm{log}T}{{B}_{\mathrm{min}}^{2}}}$, and we know that $T>n$.