Network Revenue Management with Limited Switches: Known and Unknown Demand Distributions

  • 2020-02-13 18:14:13
  • David Simchi-Levi, Yunzong Xu, Jinglong Zhao
  • 0

Abstract

This work is motivated by a practical concern from our retail partner. Whilethey respect the advantages of dynamic pricing, they must limit the number ofprice changes to be within some constant. We study the classical price-basednetwork revenue management problem, where a retailer has finite initialinventory of multiple resources to sell over a finite time horizon. We considerboth known and unknown distribution settings, and derive policies that have thebest-possible asymptotic performance in both settings. Our results suggest anintrinsic difference between the expected revenue associated with how manyswitches are allowed, which further depends on the number of resources. Ourresults are also the first to show a separation between the regret boundsassociated with different number of resources.

 

Quick Read (beta)

\crefformat

section§#2#1#3 \crefformatsubsection§#2#1#3 \crefformatsubsubsection§#2#1#3


[10pt]

Network Revenue Management with Limited Switches: Known and Unknown Demand Distributions

David Simchi-Levi, Yunzong Xu, Jinglong Zhao

Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139

[email protected], [email protected], [email protected]

Our work is motivated by a common business constraint in the retail industry. While retailers respect the advantages of dynamic pricing and price experimentation, they must limit the number of price changes to be within some range due to various practical reasons.

We study the classical price-based network revenue management problem, where a retailer has finite initial inventory of multiple resources to sell over a finite time horizon. We consider both known and unknown distribution settings, and derive policies that have the best-possible asymptotic performance in both settings. Our results suggest an intrinsic difference between the expected revenue associated with how many switches are allowed: in the distributionally-known setting, a resource-dependent amount of switching budgets is required to achieve sublinear regret; and in the distributionally-unknown setting, a resource-dependent piecewise-constant function that characterize the degree of sublinear regret.

To the best of our knowledge, our results are the first to exhibit a separation in regrets between inventory constrained and unconstrained stochastic online learning problems. Practically, retailers may benefit from our study in designing their budgets of price switches.

 

We consider the classical network revenue management problem. A firm has finite inventory of multiple resources to sell over a finite time horizon. The starting inventory is unreplenishable and exogenously given. The firm can control its sales through sequential decisions on the offered prices, which come from a discrete set of candidates. Its objective is to maximize the cumulated revenue.

We consider two settings when demand is either distributionally-known, or distributionally-unknown; in both cases we assume demand to be stochastic, independent and time homogeneous (stochastic IID demand) over the time horizon. Such settings are well studied in the literature. In the distributionally-known case, see Gallego and Van Ryzin (1997), Jasin (2014); in the distributionally-unknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018). The literature has also studied other settings assuming different demand models, where demand could be adversarial (Mehta et al. 2005, Ball and Queyranne 2009), could be correlated over time (Araman and Caldentey 2009, Truong and Wang 2019), and could be time non-homogeneous (Adelman 2007, Topaloglu 2009, Erdelyi and Topaloglu 2011). But all are beyond the scope of this paper.

In the distributionally-known setting, the firm must trade-off between revenue-centric decisions which maximize immediate expected revenue irrespective of inventory constraints, and inventory-centric decisions which maximize the yield from the remaining inventory. Revenue-centric decisions tend to be myopic and favor the most popular items, while inventory centric decisions tend to be conservative and favor the highly stocked items. Intuitively, the optimal policy alternates between revenue-centric and inventory-centric decisions based on the remaining inventory and time periods.

Moreover, in the distributionally-unknown setting, the firm must also trade-off between exploitation decisions which utilize the learned information to maximize the expected revenue as if it was in the distributionally-known setting, and exploration decisions which discover the demand distributions of the less certain actions, regardless of how rewarding an action is. Exploitation decisions tend to favor the more rewarding items (with respect to inventory constraints), while exploration decisions tend to favor the less discovered items. Intuitively, the optimal policy alternates between exploitation and exploration decisions based on the information learned from the realized demands.

In both settings, the optimal policy has to adjust its decisions and instantaneously switch between actions over the time horizon. However, not all retailers have the infrastructure to query the realized demand in real-time, to adjust its decisions instantaneously, or to switch between actions as freely as possible, because changing the posted prices is too costly for many retailers (Levy et al. 1998, Zbaracki et al. 2004), and frequent price changes may confuse the customers (Jørgensen et al. 2003). A common practice for many firms is that they restrict the number of price changes to be as few as possible; see Netessine (2006), Chen et al. (2015), Cheung et al. (2017), Chen and Chao (2019), Simchi-Levi and Xu (2019), Perakis and Singhvi (2019).

Motivated by this problem, we analyze the asymptotic performance of policies with limited switches on the network revenue management problem with known and unknown demand distribution. In both settings, we show there is an intrinsic difference between the expected revenue earned associated with how many switches are allowed, which further depends on the number of resources.

We consider the time horizon to consist of a discrete number of time periods. We model each product as having discrete “prices points” at which it could be sold. This captures situations where fixed price points have been pre-determined by market standards, e.g. a common menu of prices that end in $9.99: $69.99, $79.99, $99.99. There is quite a difference here from the papers that model each product as having a continuous price range; see Jasin (2014) for distributionally-known setting, and Wang et al. (2014), Chen and Shi (2019), Li and Zheng (2019) for distributionally-unknown setting. Our work requires completely different analytical techniques.

We model the business constraints of limited price changes as a hard constraint. The retailer is initially endowed with a constant number of switching budgets, to change the posted price (or price vector) from one to another. For example, if there is only one product, a sequence of prices ($79.99,$89.99,$79.99,$89.99) uses two distinct prices, and makes three price changes.

We will consider the following two demand models one by one, where the design and analysis of effective policies to the second model are based on understandings of the first model.

  1. 0.

    Distributionally known demand (Section id1): In each time period, a stochastic, independent and time homogeneous demand is triggered from the prices we post.

  2. 0.

    Distributionally unknown demand (Section id1): We only know that the demand is independent and time homogeneous, but we do not have any prior knowledge about the distributions. We have to learn the distributions over time.

For both settings, we can generalize our results to the “Bandits with Knapsacks” model (Badanidiyuru et al. 2013), which allows the per-unit revenue to be random, and further allows for any correlated distributions of the consumption quantity and the revenue.

We focus on the asymptotic performance of policies. We introduce our results in both settings using linear scaling as our asymptotic regime, which is omnipresent in the literature. In the distributionally-known case, see Gallego and Van Ryzin (1997), Liu and Van Ryzin (2008), Jasin (2014), Bumpensanti and Wang (2018); in the distributionally-unknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018), Chen and Shi (2019). Specifically, we assume both the time horizon and all the initial inventory levels are scaled by the same factor κ, while demand distributions in all periods remain the same.

There is a second asymptotic regime in the literature that does not require linear scaling. Instead, it only requires the minimum quantity of any resource (and time periods) to go to infinity. In the distributionally-known case, see Alaei et al. (2012), Ma et al. (2018); in the distributionally-unknown case, see Badanidiyuru et al. (2013).

We adopt the revenue loss as our optimality notion. In the distributionally-known setting, revenue loss is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal policy with infinite switching budgets. The optimal policy with infinite switching budgets could be solved using a dynamic program, yet due to curse of dimensionality people deem it hard to solve (Gallego and Van Ryzin 1997). Revenue loss is one common metric to quantify the optimality gap, among the rich literature of different metrics; see Adelman (2007), Jasin (2014), Chen and Goldberg (2018).

In the distributionally-unknown setting, revenue loss is often referred to as “regret”. We study two regret notions in this setting: the first one, referred to as the s-switch regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with the same switching budget; the second one, referred to as the overall regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with infinite switching budgets. (Here, a clairvoyant policy is endowed with perfect knowledge of the distributions, but not the exact realizations.) While the overall regret is clearly larger than the s-switch regret, interestingly, we show that asymptotically they are of the same order.

We provide full characterization of the revenue loss under two demand models, endowed with different levels of switching budgets.

In the distributionally-known setting (Section id1), there exists a critical switching budget above which we show the revenue loss is in the order of Θ(T); and below which the loss is in the order of Θ(T). Specifically, we show that for any problem instance, there exists an instance dependent quantity 𝖣𝖲𝖫 (“Dimension of the Sparsest LP-solution”, see the precise definition in Section id1), such that any policy that makes at most 𝖣𝖲𝖫-2 switches must incur an Ω(T) revenue loss, while a class of static control policies that make at most 𝖣𝖲𝖫-1 switches can guarantee an O(T) revenue loss. Combining the above results, we show that there is an intrinsic gap on expected revenue loss, if one more critical price change is allowed. See Figure 1.

Figure 1: The intrinsic gap on expected revenue loss in the distributionally-known setting. The vertical axis only shows the order, not the constants.

In the distributionally-unknown setting (Section id1), we show matching upper and lower bounds on the optimal (worst-case) regret in the order of Θ~(T12-2-ν(s,m)), where ν(m,s)=s-m-1K-1. Here s stands for the switching budget, m for the number of resources, and K for the number of feasible price vectors, see Figure 2 for an illustration. Specifically, we propose an S-Switch Balanced Exploration with LP-based Exploitation algorithm that achieves an O~(T12-2-ν(s,m)) regret; and we show a lower bound that any policy must incur an Ω~(T12-2-ν(s,m)) regret in the worst case distributions. Our result generalizes the bounds in Simchi-Levi and Xu (2019), which is a special case when m=0, i.e., no inventory constraints.

Figure 2: Regret as a function of switching budgets in the distributionally-unknown setting.

In the distributionally-unknown setting, our results reveal an interesting separation between the (stochastic) online learning problems with inventory constraints and without inventory constraints. Curiosity has been aroused around this separation for long, that the regret bounds in both the classical multi-armed bandit problems and the bandits with knapsacks problems are in the order of O~(T). Yet people believe that the problem with inventory constraints is “harder” than the unconstrained counterparts. Our work shows that the constrained and unconstrained problems are indeed different, and characterize the dependency on the number of inventory constraints. If we fix all the other problem primitives unchanged and only add in one more inventory constraint, then the regret bounds is going to be larger or equal to before, illustrating that inventory constraints indeed increase the revenue loss – they make the problem “harder”.

We establish equivalence results of limited switching budgets and limited adaptivity, in the context of network revenue management problems. Limited switching budgets, as we have introduced in Section id1, is the hard constraint that one cannot change prices more than a fixed number of times. Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. We can think of full adaptivity as the ability to collect immediate feedbacks, or to query the status of the system; and limited adaptivity corresponds to collecting batched and delayed feedbacks. An algorithm with limited switching budgets can keep track of the demands in each period, yet constrained on price changes; while an algorithm with limited adaptivity can prescribe a trajectory of different prices with unlimited changes, yet without knowing the status of the system. When either ability (switching budgets or adaptivity) is constrained, we do not need the other ability more than necessary.

Finally, we introduce new techniques in the design and analysis of asymptotically optimal algorithms, which we will discuss in the following sections when we introduce the proof sketches.

There are two streams of network revenue management problems, the quantity-based problem where customers first arrive and then seller makes accept / reject decisions, and the price-based problem where seller first announces a price and then customers makes purchase decisions. For more discussion, see Talluri and Van Ryzin (2006), Gallego et al. (2018). Quantity-based and price-based problems are very closely related. For example, when there is only one single resource, these two problems are equivalent (Maglaras and Meissner 2006).

There is a rich literature in re-optimization techniques in network revenue management, featuring in frequent or infrequent adjustments in the action space (depending on if the problem is quantity- or price- based). Re-optimization techniques choose careful moments in time to make adjustments (e.g. to change prices in the price-based context), to better respond to the randomness, and to achieve a smaller expected revenue loss. Jasin (2014), in the context of price-based network revenue management, requires O(logT) number of price changes to achieve a O(logT) order of expected revenue loss, compared to the DLP upper bound. For more discussion of the re-optimization techniques, as well as the selection of different upper bounds, see Reiman and Wang (2008), Secomandi (2008), Chen and Homem-de Mello (2010), Ciocan and Farias (2012), Jasin and Kumar (2012), Bumpensanti and Wang (2018) for quantity-based re-optimization methods; see Jasin (2014), Chen et al. (2015) for price-based re-optimization methods; see Chen and Farias (2013), Vera et al. (2019) for re-optimization methods on single-resource dynamic pricing problems; see Arlotto and Xie (2018), Arlotto and Gurvich (2019), Vera and Banerjee (2019) for re-optimization methods on multi-secretary problems and multiple constraint knapsack problems, which are directly related to the network revenue management problem.

There is a rich literature on stochastic multi-armed bandit (MAB) problems with switching costs Agrawal et al. (1988, 1990), Cesa-Bianchi et al. (2013). See Jun (2004) for a survey. Most paper from this stream of research penalizes switching costs into the objective function of the reward calculation. That is, the objective is to minimize a convex combination of the regret incurred, and the switching costs. By contrast, our paper treats the switching budgets as hard constraint that can never be violated.

A closely related paper is Simchi-Levi and Xu (2019), which discusses the stochastic multi-arm bandit problem with the same hard constraint of switching budgets as us. Our work generalize their results by incorporating additional knapsack constraints Badanidiyuru et al. (2013). This extension gives us more modeling flexibility in network revenue management, as well as medical trials, and online advertising, etc. Moreover, our work develops significant new techniques to address new challenges due to the knapsack constraints. Even in the distributionally-known case, a revenue maximizing policy has to switch between prices to avoid consuming some inventory too quickly; by contrast, without knapsack constraints, the best policy always sticks to one single price.

There are several ways to model the business constraint of limited switching budgets. Literature have studied quite a few models of infrequent price changes, which are similar to the switching budgets we consider in our paper. In the distributionally-known setting, Netessine (2006) considers a single-product deterministic demand model, and focuses on the impact of a limited number of price changes. Chen et al. (2015) considers the same network revenue management model as we do, and focuses on asynchronously changing the prices of a small subset of products. In the distributionally unknown setting, Cheung et al. (2017) considers a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy makes limited number of price changes. Chen and Chao (2019) studies a multi-period stochastic joint inventory replenishment and pricing problem with unknown demand and limited price changes. Simchi-Levi and Xu (2019) considers the stochastic multi-armed bandit problem with a general switching constraint. All of the above three papers adopt the idea of switching budgets as this paper does. Yet none of the above paper considers the existence of non-replenishable inventory constraints.

To the best of our knowledge, this paper is the first to characterize the exact dependency of expected revenue loss on limited switching budgets, in the price-based network revenue management setup, with and without the knowledge of demand distributions. In particularly, results in this paper generalize and unify several results appearing in the following papers: Gallego and Van Ryzin (1994), Besbes and Zeevi (2012), Badanidiyuru et al. (2013), Ferreira et al. (2018), Ma et al. (2018), Simchi-Levi and Xu (2019).

Throughout this paper, let be the set of real numbers, and + the set of non-negative real numbers. Let be the set of positive integers. N, let [N]={1,2,,N} denote the set of integers that are no more than N. We use bold font letters for vectors, where we do not explicitly indicate how large the dimension is. Let 𝒑𝖳 be the transpose of any vector 𝒑. For any vector 𝒙K, let 𝒙0=k[K]𝟙xk0 be the L0 norm of 𝒙, i.e. the number of non-zero elements in vector 𝒙. For any real number x, let x+=max{x,0} be the non-negative part of x. For any positive real number x+, let x be the largest integer that is smaller or equal to x; for any non-positive real number x-+, let x=0. We use big O,Ω,Θ notations to hide constant factors, and use O~,Ω~,Θ~ notations to hide both constant factors and logarithmic factors.

We study the classical network revenue management problem. Let there be discrete, finite time horizon with T periods. Time starts from period 1 and ends in period T. Let there be n different products generated by m different resources, each resource endowed with finite initial inventory Bi,i[m]. Let A=[aij]i[m],j[n] be the consumption matrix. Each entry aij+ stands for the amount of inventory i[m] used, if one unit of product j[n] is sold. Let a𝗆𝖺𝗑=maxi,jaij. Each column Aj stands for the consumption vector of each resource. Each column contains at least one nonzero entry. Let Ai denote the i-th row of A.

In each period t, the firm can offer a price for each product from a finite set of K price vectors, which we denote using {𝒑1,𝒑2,,𝒑K}. Here 𝒑k=(p1,k,p2,k,,pn,k) are the prices for products 1,,n, under price vector kK. All the price vectors are fixed and given. Let p𝗆𝖺𝗑=maxj,kpj,k be the maximum price. Given price 𝒑k, the demand for each product j[n] is a given Bernoulli random variable Qj,k:=Qj(𝒑k){0,1}. Denote qj,k:=𝔼[Qj,k] as the expected demand. Our analysis can be directly generalized to general bounded distributions Qj,k[0,1], similar to most prophet inequality and revenue management literature Kleinberg and Weinberg (2012), Alaei et al. (2012), Dütting et al. (2017), Liu and Van Ryzin (2008), Jasin (2014), Ma et al. (2018). For the ease of exposition we describe our results in the Bernoulli setting, yet our proof only relies on the fact that demands are bounded between [0,1].

For each unit of demand generated for product j[n] under price vector 𝒑k,k[K], the firm generates pj,k units of revenue by depleting aij units of each inventory i[m]. If no demand is generated, all the remaining inventory is carried over into the next period. The selling process stops immediately when the total cumulative demand of any resource exceeds its initial inventory.

This modeling framework gives us the flexibility to have two identical consumption vectors sold at two different prices. We simply treat them as different products, i.e. in the airline context, main cabin vs. basic economy.

In Section id1 we have addressed the business constraint that a firm is often prevented from changing the price vectors too many times. Throughout the horizon, we can change the posted price vectors for no more than S times. We treat S as a fixed constant. Intuitively, when S is small we are more constrained, and when S is large we are close to the classical network revenue management problem studied in the literature. Our objective is to maximize the total expected revenue in the face of limited switches.

We adopt the general notation π:n×[T][n] to denote any policy with the full information about stochastic distributions, that suggests which price vector to use given the remaining periods T and remaining inventory 𝑩. For any s, let Π[s] be the set of policies that changes prices for no more than s times. For any s,s such that ss, Π[s]Π[s]. Let Π:=lims+Π[s] be the set of policies with infinite switching budgets (which is the set of all feasible policies when there is no switching constraint). Let 𝖱𝖾𝗏(π) be the expected revenue that policy π generates in expectation. Let π*[s]argmaxπΠ[s]𝖱𝖾𝗏(π) be one of the optimal dynamic policies with switching budget s.

We adopt linear scaling as our asymptotic regime. We assume all the problem parameters =(m,n,K,𝒑,𝒒,A) are fixed, where 𝒑:=(𝒑k)k[K] and 𝒒:=(qj,k)j[n],k[K]. We assume T and Bj,j[m] are at the same comparable order, and larger than either one of m,n,K.

We summarize our results, as well as some closely related results from literature, in Figure 3.

Figure 3: Summary of results in the distributionally-known setting. Results in the literature are on the left side; our results are on the right side.

For any problem instance =(m,n,K,𝒑,𝒒,A), litarature have studied the following deterministic linear program (DLP).

𝖩𝖣𝖫𝖯=max{xk}k[K]k[K]j[n]pj,kqj,kxk (1)
s.t.k[K]j[n]aijqj,kxk Bi i[m] (2)
k[K]xk T (3)
xk 0 k[K] (4)

It is well known that the above DLP serves as an upper bound to any policy πΠ (“DP OPT” as in Figure 3), even an optimal policy with infinite switching budgets. The worst-case gap between the best policy and the DLP upper bound is in the order of Θ(T)11 1 We give some comments on the Ω(T) lower bound for this gap. It is a worst-case lower bound from Bumpensanti and Wang (2018), Ma et al. (2018). Both these results from literature require the feasible price set to have only one price (vector). Since there is only one price, the optimal dynamic programming policy is essentially a fixed, zero-switch policy, which is a special case when 𝖣𝖲𝖫=1. It remains an interesting open question how small could the gap be, between the DLP upper bound and any policy endowed with more than (𝖣𝖲𝖫-1) switching budgets., as shown in Figure 3.

Our results contribute to the literature by first characterizing how a policy’s switching budget affects its revenue loss. The main implication of our results is that for each problem instance, there is an intrinsic gap between the best achievable performance of policies with Λ-2 switches and policies with Λ-1 switches, as shown in Figure 3.

We start with a closer look of the DLP optimal solutions. Let the set of optimal solutions from the DLP be X*=argmax𝒙K{(the-at-equationgroup-at-ID)|(the-at-equationgroup-at-ID),(the-at-equationgroup-at-ID),(the-at-equationgroup-at-ID) are satisfied}. Let 𝖣𝖲𝖫=min{𝒙0|𝒙X*} be the least number of non-zero variables of any optimal solution, namely, the Dimension of Sparsest LP–solutions. Let 𝒟𝒮=argmin{𝒙0|𝒙X*} be the set of such solutions. For any 𝒙*𝒟𝒮, let 𝒦(𝒙*)={k[K]|𝒙*0}[K] be the subset of dimensions that are non-zero in 𝒙*.

Theorem 1

For any problem instance I=(m,n,K,𝐩,𝐪,A), there is an associated DSL number. Any policy πΠ[DSL-2] earns an expected revenue

𝖱𝖾𝗏(π)𝖩𝖣𝖫𝖯-Ω(T)

We outline three key steps here and defer the details of our proof to Appendix id1. We first identify a clean event, such that the realized demands are close to the expected demands that the LP suggests. This clean event happens with high probability (1-2T3). In the second step, conditioning on such event, the maximum amount of revenue we generate is no more than O(T) compared to what the LP suggests; and the minimum amount of inventory demanded is no less than O(T) compared to what the LP suggests, resulting in no more than O(T) of realized revenue. In the third step, we show that the revenue loss from insufficient price changes is scaled in the order of O(T), which dominates the O(T) amount revenue due to randomness. Such clean event analysis, originating from the online learning literature (Badanidiyuru et al. 2013, Lattimore and Szepesvári 2018, Slivkins 2019), is new to the stochastic control literature.

We highlight that the lower bound established in Theorem 1 is a per-instance lower bound, as it holds for every single problem instance. Such bound is much stronger than the worst-case type lower bounds that are omnipresent in the revenue management literature, which only holds for a specific instance, see the following example in our setting.

Proposition 1 (Proposition 4, Bumpensanti and Wang (2018); Lemma 5, Ma et al. (2018))

There exists a problem instance I0=(m,n,K,𝐩,𝐪,A), such that any policy πΠ earns an expected revenue

𝖱𝖾𝗏(π)𝖩𝖣𝖫𝖯-Ω(T)

The static control policy is a well known policy in the literature that has an asymptotic guarantee of O(T) revenue loss. See Gallego and Van Ryzin (1997), Cooper (2002), Maglaras and Meissner (2006), Liu and Van Ryzin (2008). In this section, we modify the well-established static control policy to prove a similar result in our setting, when the selling horizon stops immediately when the total cumulative demand of any resource exceeds its initial inventory. Our setting requires different techniques, because the papers listed above on the static control policy do not explicitly address when the cumulative demand would exceed the initial inventory. When this does happen half-way, the static control policy loses chances to gain revenue in the remaining periods, even if there are still other remaining inventory.

Nonetheless, we tweak the static control policy, so that with high probability the selling horizon never stops earlier than the last period T. This is achieved by selecting the value of γ in the first step of Algorithm 1. Our policy has an asymptotic guarantee of O(logTT)𝖩𝖣𝖫𝖯 revenue loss, which is slightly worse off by a logT factor than the revenue loss in the classical setting. The logT term comes from the Chernoff bound using concentration inequalities. Similar ideas have been used in Hajiaghayi et al. (2007), Ma et al. (2018), Balseiro et al. (2019) to prove asymptotic results in similar but different settings.

Algorithm 1 Tweaked LP policy

Input: =(m,n,K,𝒑,𝒒,A).

Policy:

\[email protected]@algorithmic

[1] \STATEDefine γ=1-2a𝗆𝖺𝗑nTlogTBmin2. \STATESolve the DLP as defined by (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), and (the-at-equationgroup-at-ID). Find an optimal solution with the least number of non-zero variables, 𝒙*𝒟𝒮. \STATEArbitrarily choose any permutation σ:[𝖣𝖲𝖫]𝒦(𝒙*) from all (𝖣𝖲𝖫)! possibilities. \STATESet the price vector to be 𝒑σ(1) for the first γxσ(1)* periods, then 𝒑σ(2) for the next γxσ(2)* periods, …, and finally 𝒑σ(𝖣𝖲𝖫) for the last T-γl=1𝖣𝖲𝖫-1xσ(l)* periods (we assume that xk,k[K] are integers, because rounding issues incur a revenue loss of at most (mmaxk𝒑k𝖳𝒒k), which is negligible in an asymptotic regime).

We explain the third step permutation. Suppose 𝒦(𝒙*)={1,3,4}. In this case, 𝖣𝖲𝖫=3 and there are 6 permutations. There are 6 possible policies as suggested in Definition 1. Some of these policies have better performance than others. But in an asymptotic regime, they all have sublinear regret, compared to the DLP upper bound.

Theorem 2

Any policy π as defined in Algorithm 1 earns an expected revenue

𝖱𝖾𝗏(π)𝖩𝖣𝖫𝖯(1-O(logTT))=𝖩𝖣𝖫𝖯-O~(T).

We outline two key steps here and defer the details of our proof to Appendix id1. In the first step, we show that with high probability, the selling horizon never stops earlier than the last period T. Second, conditioning on this high probability event, the expected revenue is at least γ fraction of the LP objective. Putting two steps together we know the total revenue loss is in the order of 1-γ.

Combining Theorem 2 with Theorem 1, we identify an intrinsic performance gap when the switching budget is around a critical threshold Λ-1. See Figure 3. Combining Theorem 2 with Proposition 1, we know that the Tweaked LP policy is asymptotically optimal (in the sense that it achieves the optimal revenue loss rate relative to the DLP upper bound), using only a few switches.

We keep studying the model described in Section id1, in the distributionally-unknown setting. Given price 𝒑k, the demand for each product j[n] is an unknown Bernoulli22 2 All our results can be directly extended to general unknown bounded random variables without further steps. random variable, Qj,k:=Qj(𝒑k){0,1}. That is, the expected value qj,k:=𝔼[Qj,k] is unknown and has to be sequentially learned over time. Our analysis can be directly generalized to general bounded distributions Qj,k[0,1], similar to most online learning literature (Lattimore and Szepesvári 2018, Slivkins 2019). The main trade-off we would like to address in this setting is between exploration and exploitation. Let 𝒒=(qj,k)j[n],k[K] denote the true “mean demand matrix” that is unknown to the seller. We also call 𝑸 an “environment” The seller wants to learn the true environment while concurrently maximizing the expected total revenue over T periods.

We adopt the following asymptotic regime that is also considered in Besbes and Zeevi (2012) and Ferreira et al. (2018): (1) the problem parameters (m,n,K,𝒑,A) are fixed, and (2) T and Bj,j[m] are at the same comparable order, and much larger than either one of (m,n,K,𝒑,A). In other words, Bi=Θ(T) for all i[m], while m,n,K,𝒑,A=O(1). Without loss of generality, we assume that m+1K (otherwise there are always redundant resource constraints that can be eliminated).

Let ϕ denote any non-anticipating learning policy, and ϕt[k] denote the price vector chosen by policy ϕ at period t[T]. More formally, ϕt establishes a probability kernel acting from the space of historical actions and observations to the space of actions at period t. For any s, let Φ[s] be the set of learning policies that change price vectors for no more than s times almost surely for all 𝑸. For any s,s such that ss, Φ[s]Φ[s]. Let Φ:=lims+Φ[s] be the set of learning policies with infinite switching budgets. Let 𝖱𝖾𝗏𝑸(ϕ) be the expected revenue that a learning policy ϕ generates under environment 𝑸.

Let π denote any clairvoyant policy with the full knowledge of the true environment 𝑸. For any s, let Π𝑸[s] be the set of clairvoyant policies that change price vectors for no more than s times under the true environment 𝑸. For any s,s such that ss, Π𝑸[s]Π𝑸[s]. Let Π𝑸:=lims+Π𝑸[s] be the set of clairvoyant policies with infinite switching budgets. Let 𝖱𝖾𝗏𝑸(π) be the expected revenue that a clairvoyant policy πΠ𝑸 generates under environment 𝑸. Let π𝑸*[s]=argsupπΠ𝑸[s]𝖱𝖾𝗏(π) be the optimal clairvoyant policy with an switching budget of s.

The performance of an s-switch learning policy ϕΦ[s] is measured against the performance of the optimal s-switch clairvoyant policy π𝑸*[s]. Specifically, we define the s-switch regret of a learning policy ϕΦ[s] as the worst-case difference between the expected revenue of the optimal s-switch clairvoyant policy π𝑸*[s] and the expected revenue of policy ϕ:

Rsϕ(T)=sup𝑸{𝖱𝖾𝗏𝑸(π𝑸*[s])-𝖱𝖾𝗏𝑸(ϕ)}.

We also measure the performance of policy ϕ against the performance of the optimal unlimited-switch clairvoyant policy π𝑸*:=π𝑸*[]. Specifically, we define the overall regret of a learning policy ϕΦ[s] as the worst-case difference between the expected revenue of the optimal unlimited-switch clairvoyant policy π𝑸* and the expected revenue of policy ϕ:

Rϕ(T)=sup𝑸{𝖱𝖾𝗏𝑸(π𝑸*)-𝖱𝖾𝗏𝑸(ϕ)}.

Intuitively, the s-switch regret Rsϕ(T) measures the “informational revenue loss” due to not knowing the demand distributions, while the overall regret Rϕ(T) measures the “overall revenue loss” due to not knowing the demand distributions and not being able to switch freely. Clearly, the overall regret Rϕ(T) is always larger than the s-switch regret Rsϕ(T). In addition, according to the results in Section id1, we know that for all sm,

Rϕ(T) =sup𝑸{𝖱𝖾𝗏𝑸(π𝑸*)-𝖱𝖾𝗏𝑸(ϕ)}
sup𝑸{𝖱𝖾𝗏𝑸(π𝑸*[s])-𝖱𝖾𝗏𝑸(ϕ)}+sup𝒒{𝖱𝖾𝗏𝒒(π𝑸*)-𝖱𝖾𝗏𝒒(π𝑸*[s])}
=Rsϕ(T)+O(T);

and for all s<m, there exist A,𝒑 such that

Rϕ(T) =sup𝒒{𝖱𝖾𝗏𝒒(π𝑸*)-𝖱𝖾𝗏𝒒(ϕ)}
=sup𝒒{𝖱𝖾𝗏𝒒(π𝑸*[s])-𝖱𝖾𝗏𝒒(ϕ)+𝖱𝖾𝗏𝒒(π𝑸*)-𝖱𝖾𝗏𝒒(π𝑸*[s])}
=Rsϕ(T)+Ω(T);

Consider the DLP defined by (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), and (the-at-equationgroup-at-ID). Let DLP𝑸 denote the DLP when its environment is 𝑸. Let 𝖩𝑸DLP(𝒙) denote the objective value of a feasible solution 𝒙 to DLP𝑸. Let 𝖩𝑸DLP denote the optimal objective value of DLP𝑸.

We propose a policy that provides an upper bound on both the s-switch regret and the overall regret. Our policy, called the S-Switch Balanced Exploration with LP-based Exploitation (SS-BELE) policy, is described in Algorithm 2. The policy presents a novel combination of the ingredients from the SS-SE policy proposed in Simchi-Levi and Xu (2019), the Balanced Exploration policy proposed in Badanidiyuru et al. (2013), and the Tweaked LP policy defined in Algorithm 1.

Algorithm 2 S-Switch Balanced Exploration with LP-based Exploitation (SS-BELE)

Input: K,T,𝑩,s,m,n,A,𝒑.

Initialization: Calculate ν(s,m)=s-m-1K-1. Define t0=0 and

tj=K1-2-2-(i-1)2-2-ν(s,m)T2-2-(i-1)2-2-ν(s,m),j=1,,ν(s,m)+1.

Define γ=BminBmin+5amaxnKlog[mKT]logTt1. Let I1={𝑸0qj,k1,j[n],k[K]}. Let Tl denote the ending period of epoch l (which will be determined by the algorithm). Let T0=0. Let zt denote the algorithm’s action at period t. Let z0 be a random action in {𝒑1,𝒑2,,𝒑K}.
Policy: \[email protected]@algorithmic[1] \FORepoch l=1,,ν(s,m) \STATE(Skip this step if l=1) Let nk(Tl-1) be the total number of periods that price vector 𝒑k is chosen up to period Tl-1, and q¯j,k(Tl-1) be the empirical mean demand of product j sold at price vector 𝒑k up to period Tl-1. Calculate rk(Tl-1)=log[(m+1)KT]nk(Tl-1). Update

Il=Il-1{𝑸|j[n]pj,kqj,k[Lkrev(Tl-1),Ukrev(Tl-1)],j[n]aijqj,k[Li,kcost(Tl-1),Ui,kcost(Tl-1)]},

where

{Ukrev(Tl-1)=j[n]pj,kq¯j,k(Tl-1)+||𝒑k||2rk(Tl-1),Lkrev(Tl-1)=j[n]pj,kq¯j,k(Tl-1)-||𝒑k||2rk(Tl-1),k[K],
{Ui,kcost(Tl-1)=j[n]aijq¯j,k(Tl-1)+||Ai||2rk(Tl-1),Li,kcost(Tl-1)=j[n]aijq¯j,k(Tl-1)-||Ai||2rk(Tl-1),i[m],k[K].
\STATE

Compute the set Δl={𝒙𝑸Il such that 𝒙 is an optimal solution to DLP𝑸}.  \STATEFor each i[K], pick any solution 𝒙l,iΔl such that (𝒙l,i)i12max𝒙Δl(𝒙)i. Let Nkl=i=1Ktl-tl-1KT(𝒙l,i)k for all k[K]. \STATELet zTl-1+1=zTl-1. Starting from this action, choose each price vector 𝒑k for γNkl consecutive periods, k[K] (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. \STATEEnd epoch l. Mark the last chosen action in epoch l as zTl. \ENDFOR\STATEFor epoch ν(s,m)+1, calculate 𝒒~=(q¯j,k(tν(s,m)))j[n],k[K]. Solve DLP𝑸~ and find an optimal solution to DLP𝑸~ with the least number of non-zero variables, 𝒙𝑸~*𝒟𝒮𝑸~. Let Nkν(s,m)+1=(T-tν(s,m))T(𝒙𝑸~*)k for all k[K]. Let zTν(s,m)+1=zTν(s,m). Starting from this action, choose each price vector 𝒑k for γNkν(s,m)+1 consecutive periods, k[K] (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. End epoch ν(s,m)+1.

We show that the SS-BELE policy is indeed an s-switch policy and establish the following upper bound on its regret.

Theorem 3

Let ϕ be the SS-BELE policy, then ϕΦ[s] and

Rsϕ(T)Rϕ(T)=O~(T12-2-ν(s,m)),

where ν(s,m)=s-m-1K-1.

\@trivlist

Let l~ be the last epoch in the execution of policy ϕ, and let τ be the last period before the policy ϕ stops. We know that τ+1 is a stopping time and we have Tl~-1<τtl~T. Since ϕ makes at most (K-1)(l~-1) switches before Tl~-1 and makes at most m+1 switches after Tl~-1, its total number of switches is always upper bounded by (K-1)ν(s,m)+(m+1)s.

We use a coupling argument for the regret analysis. Consider a virtual policy ϕ𝗏 that runs under exactly the same demand realization process and acts exactly the same as ϕ until period τ, but keeps running until the end of epoch ν(s,m)+1 regardless of the inventory constraints. Without conflicts to the previously defined notations in Algorithm 2, let Tl denote the last period of epoch l under policy ϕ𝗏 (l[ν(s,m)+1]), let nk(t) be the total number of periods that price vector 𝒑k is chosen by ϕ𝗏 during period 1 to t (k[K],t[Tν(s,m)+1]), and let q¯j,k(t) be the average realized demand of product j sold at price vector 𝒑k during period 1 to t (j[n],k[K],t[T]). Define the confidence radius as

rk(t)=log[(m+1)KT]nk(t),k[K],t[T].

Define the clean event as

{i[m],k[K],t[T],{|j[n]aijq¯j,k(t)-j[n]aijqj,k|||Ai||2rk(t),|j[n]pj,kq¯j,k(t)-j[n]pj,kqj,k|||𝒑k||2rk(t).}.

By the Hoeffding’s inequality for general bounded random variables and arguments analogous to Lemma 1.5 in Slivkins (2019), we have Pr()1-2(m+1)KT under environment 𝑸 and policy ϕ𝗏. Since the clean event happens with very high probability, we can just focus on a clean execution of policy ϕ𝗏: an execution in which the clean event holds.

Conditional on the clean event, for all i[m] and l[ν(s,m)+1], we have

k[K]j[n]aijq¯j,k(Tl)nk(Tl)
k[K]j[n]aijqj,knk(Tl)+k[K]||Ai||2rk(Tl)nk(Tl)
= k[K]j[n]aijqj,knk(Tl)+||Ai||2log[(m+1)KT]k[K]nk(Tl)
k[K]j[n]aijqj,knk(Tl)+||Ai||2log[(m+1)KT]KTl, (5)

where the last inequality is by Jensen’s inequality. For notational convenience, define 𝒙ν(s,m)+1,i=𝒙𝑸~* for all i[K]. Then we have nk(Tl)=r=1lγNkr=γr=1lo=1Ktr-tr-1KT(𝒙r,o)k for all l[ν(s,m)+1]. For all l[ν(s,m)+1],o[K], let 𝑸r,oIl denote an environment such that 𝒙r,o is an optimal solution to DLP𝑸r,o. Then for all i[m] and l[ν(s,m)+1], we have

k[K]j[n]aijqj,knk(Tl)
= γr[l](tr-tr-1)k[K]j[n]aijqj,ko[K]1KT(𝒙r,o)k
γr[l](tr-tr-1)k[K](j[n]aijqj,kr,o+2||Ai||2rk(Tr-1))o[K]1KT(𝒙r,o)k
γr[l](tr-tr-1)o[K]BiKT+γr[l](tr-tr-1)o[K]2||Ai||2KTk[K]rk(Tr-1)(𝒙r,o)k
= γtlBiT+γr[l](tr-tr-1)o[K]2||Ai||2KTk[K]rk(Tr-1)(𝒙r,o)k, (6)

where the first inequality is by the definition of Ir and the clean execution of policy ϕ𝗏, and the second inequality is by the feasibility of 𝒙l,o to DLP𝑸r,o. We now bound k[K]rk(Tr-1)(𝒙r,o)k. Since IrIr-1I1, we have ΔrΔr-1Δ1. By the specification of Algorithm 2, we have (𝒙i,k)k12(𝒙r,o)k for all i<r,o[K],k[K]. Hence

nk(Tr-1)=i=1r-1γNkiγi=1r-1ti-ti-1KT(𝒙i,k)kγi=1r-1ti-ti-12KT(𝒙r,o)k=γtr-12KT(𝒙r,o)k.

Thus

k[K]rk(Tr-1)(𝒙r,o)k=k[K]log[(m+1)KT]nk(Tr-1)(𝒙r,o)k
k[K](𝒙r,o)k2KTlog[(m+1)KT]γtr-1KT2log[(m+1)KT]γtr-1, (7)

where the last inequality is by (the-at-equationgroup-at-ID) and Jensen’s inequality. Combining (theequation-at-IDv) and (theequation-at-IDab), we have

k[K]j[n]aijqj,knk(Tl)γBiTtl+2||Ai||2K2γlog[(m+1)KT]r[l]tr-tr-1tr-1
γBiTtl+(4||Ai||2Klog[mKT]logT)t1γBiTtl+(4amaxnKlog[mKT]logT)t1. (8)

Combining (theequation-at-IDr) and (theequation-at-IDad), we know that for all i[m] and l=ν(s,m)+1,

k[K]j[n]aijq¯j,k(Tν(s,m)+1)nk(Tν(s,m)+1)γBiTtν(s,m)+1+(5||Ai||2Klog[mKT]logT)t1=Bi.

This indicates that conditional on the clean event, policy ϕ𝗏 will not violate any inventory constraint. By the coupling relationship between ϕ and ϕ𝗏, we know that τ=Tν(s,m)+1 conditional on the clean execution of policy ϕ. Thus conditonal on the clean event, the total revenue collected by policy ϕ is

k[K]j[n]pj,kq¯j,k(Tν(s,m)+1)nk(Tν(s,m)+1)
k[K]j[n]pj,kqj,knk(Tν(s,m)+1)-k[K]||𝒑k||2rk(Tν(s,m)+1)nk(Tν(s,m)+1)
k[K]j[n]pj,kqj,knk(Tν(s,m)+1)-pmaxnlog[(m+1)KT]KTν(s,m)+1. (9)

We now bound k[K]j[n]pj,kqj,knk(Tν(s,m)+1) conditional on the clean event. We have

k[K]j[n]pj,kqj,knk(Tν(s,m)+1)
= γr[ν(s,m)+1](tr-tr-1)k[K]j[n]pj,kqj,ko[K]1KT(𝒙r,o)k
γr[ν(s,m)+1](tr-tr-1)k[K](j[n]pj,kqj,kr,o-2||𝒑k||2rk(Tr-1))o[K]1KT(𝒙r,o)k
= γr[ν(s,m)+1](tr-tr-1)o[K]𝖩𝑸r,o𝖣𝖫𝖯KT-γr[l](tr-tr-1)o[K]2pmaxnKTk[K]rk(Tr-1)(𝒙r,o)k
γ𝖩𝒒𝖣𝖫𝖯-γr[ν(s,m)+1](tr-tr-1)o[K]𝖩𝒒𝖣𝖫𝖯-𝖩𝒒r,o𝖣𝖫𝖯KT-(4pmaxnKlog[mKT]logT)t1, (10)

where the last inequality is by (theequation-at-IDab) and

γr[l](tr-tr-1)o[K]2pmaxnKTk[K]rk(Tr-1)(𝒙r,o)k(4pmaxnKlog[mKT]logT)t1.

Let 𝒙𝒒* denote an optimal solution to DLP𝒒, we have

𝖩𝑸𝖣𝖫𝖯-𝖩𝑸r,o𝖣𝖫𝖯(𝖩𝑸𝖣𝖫𝖯+k[K]||𝒑k||2rk(Tr-1)(𝒙𝒒*)k)(k[K]amaxnrk(Tr-1)(𝒙𝒒*)kBmin+k[K]amaxnrk(Tr-1)(𝒙𝒒*)k). (11)

By arguments analogous to (theequation-at-IDab), we have

k[K]rk(Tr-1)(𝒙𝒒*)kKT2log[(m+1)KT]γtr-1. (12)

Combining (theequation-at-IDag), (theequation-at-IDaj), (11), (12), we know that the total revenue collected by the policy ϕ is lower bounded by

γ𝖩𝒒𝖣𝖫𝖯-(1+𝖩𝒒𝖣𝖫𝖯Bmin)5max{amax,pmax}nKlog[mKT]logTt1.

Thus both Rsϕ(T) and Rϕ(T) are upper bounded by

CnKlog[mKT]logTt1=CnKlog[mKT]logTK1-12-2-ν(s,m)T12-2-ν(s,m),

where C is a constant that only depends on amax and pmax.  \@endparenv

We prove a matching lower bound for both the s-switch regret and the overall regret.

Theorem 4

For any T1,K1,m0,s0, there exists a distrubitionally-unknown problem instance I={(m,n,K,𝐩,A)} such that for all ϕΦ[s],

Rϕ(T)Rsϕ(T)=Ω~(T12-2-ν(s,m)),

where ν(m,s)=s-m-1K-1.

Proof Idea. We construct a problem instance as follows. Let there be n=m+1 resources. Let

A=(100011)

so that each product only consumes one resource: product i[m-1] consumes resource i, and products m and (m+1) consume resource m. For some ϵ>0, let

pj,k={1+ϵ,if 1j=km,1-ϵ,if j=k=m+1,0,otherwise.

Define β to be the unique solution over (0,+) to:

i=1m-1Biβ+ϵ+Bmβ-T=0.

Let

qj,k={β+μj,if 1j=km+10,otherwise,

where μ1,,μm+1 are small gaps that we will assign different values in the lower bound proof. The main property of the above problem instance is that as long as μ1,,μm+1 are all smaller than a constant threshold, no matter how we adjust μ1,,μm+1, the resulting DLP𝒒 is always non-degenerate, i.e., the resulting Λ is m+1. Then by Theorem 1, even for a clairvoyant policy that knows 𝒒, it needs to make at least m switches to guarantee a sublinear regret. The rest of the proof is built on the “tracking the covering time” argument, originally proposed in Simchi-Levi and Xu (2019). However, our proof require more delicate analysis for the values of different LP, and require us to integrate Theorem 1 into the analysis of the last two stopping times.

Combining Theorem 3 and Theorem 4, we can see that for any switching budget s, the optimal (worst-case) regret of the distributionally-unknown network revenue management problem is in the order of Θ~(T12-2-s-m-1K-1). This suggests that given a fixed switching budget s, an increase in the number of resources m may result in an increase in the order of the optimal regret. To the best our knowledge, this is the first result of such kind that explicitly characterizes how the dimension of the inventory constraints makes a stochastic online learning problem “harder”. In other words, an increase in the number of resources increases the required number of switches to achieve a same regret bound. While both the classical multi-armed bandit problem and the bandits with knapsacks problem (Badanidiyuru et al. 2013) / distributionally-unknown network revenue management problem (Besbes and Zeevi 2012, Ferreira et al. 2018) studied in previous literature essentially exhibit the same optimal regret rate in the order of Θ~(T), the order of T is not affected by the dimension of the inventory constraints. Our results indicate an intrinsic separation in the optimal regret bounds, due to the switching budget constraint.

Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems and in Perchet et al. (2016) for stochastic bandit problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. As a concrete example of limited-adapativity problems, the M-batched multi-armed bandit problem is a variant of the classical MAB problem where the decision maker must split her learning process into M batches and is only able to observe data from a given batch after the entire batch is completed, see Perchet et al. (2016), Gao et al. (2019).

Our SS-BELE policy is not only a limited-switch policy, but also a limited-adaptivity policy — throughout the entire T periods, it only queries the collected data for at most ν(s,m) times. Indeed, the policy runs in at most ν(s,m)+1 epochs, and within each epoch the policy’s actions can be pre-determined in the beginning of the epoch. Our policy can thus be treated as an M-batched policy, providing O~(T11-21-M) regret for the M-batched network revenue management / bandits with knapsacks problems, which generalizes the previous upper bound results for unconstrained batched bandit problems (Perchet et al. 2016, Gao et al. 2019). Moreover, since a regret lower bound for limited-switch policies is quite strong and always implies a regret lower bound for limited-adaptivity policies (as noted in Simchi-Levi and Xu 2019), our lower bound in Theorem 4 directly provides a matching regret lower bound for the M-batched network revenue management / bandits with knapsacks problems. We thus state the following results:

Proposition 2

For a distributionally-unknown network revenue management problem parameterized by fixed (m, n, 𝐁, T, K, 𝐩), there exists a mapping between the family of limited adaptivity problems and the family of limited switching budget problems, such that following each mapping, the two corresponding problems have the same optimal regret (up to logarithmic factors).

Note that Simchi-Levi and Xu (2019) establishes a similar “regret equivalence” result between limited switches and limited adaptivity for the classical MAB problem without inventory constraints. Our regret equivalence result is a generalization of their result to inventory-constrained settings.

References

  • Adelman (2007) Adelman D (2007) Dynamic bid prices in revenue management. Operations Research 55(4):647–661.
  • Agrawal et al. (1988) Agrawal R, Hedge M, Teneketzis D (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10):899–906.
  • Agrawal et al. (1990) Agrawal R, Hegde M, Teneketzis D (1990) Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic reports 29(4):437–459.
  • Alaei et al. (2012) Alaei S, Hajiaghayi M, Liaghat V (2012) Online prophet-inequality matching with applications to ad allocation. Proceedings of the 13th ACM Conference on Electronic Commerce, 18–35 (ACM).
  • Araman and Caldentey (2009) Araman VF, Caldentey R (2009) Dynamic pricing for nonperishable products with demand learning. Operations research 57(5):1169–1188.
  • Arlotto and Gurvich (2019) Arlotto A, Gurvich I (2019) Uniformly bounded regret in the multisecretary problem. Stochastic Systems 9(3):231–260.
  • Arlotto and Xie (2018) Arlotto A, Xie X (2018) Logarithmic regret in the dynamic and stochastic knapsack problem. arXiv preprint arXiv:1809.02016 .
  • Badanidiyuru et al. (2013) Badanidiyuru A, Kleinberg R, Slivkins A (2013) Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 207–216 (IEEE).
  • Ball and Queyranne (2009) Ball MO, Queyranne M (2009) Toward robust revenue management: Competitive analysis of online booking. Operations Research 57(4):950–963.
  • Balseiro et al. (2019) Balseiro SR, Brown DB, Chen C (2019) Dynamic pricing of relocating resources in large networks. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 29–30 (ACM).
  • Besbes and Zeevi (2012) Besbes O, Zeevi A (2012) Blind network revenue management. Operations research 60(6):1537–1550.
  • Bumpensanti and Wang (2018) Bumpensanti P, Wang H (2018) A re-solving heuristic with uniformly bounded loss for network revenue management. arXiv preprint arXiv:1802.06192 .
  • Cesa-Bianchi et al. (2013) Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 1160–1168.
  • Chen and Chao (2019) Chen B, Chao X (2019) Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
  • Chen and Homem-de Mello (2010) Chen L, Homem-de Mello T (2010) Re-solving stochastic programming models for airline revenue management. Annals of Operations Research 177(1):91–114.
  • Chen et al. (2015) Chen Q, Jasin S, Duenyas I (2015) Real-time dynamic pricing with minimal and flexible price adjustment. Management Science 62(8):2437–2455.
  • Chen and Farias (2013) Chen Y, Farias VF (2013) Simple policies for dynamic pricing with imperfect forecasts. Operations Research 61(3):612–624.
  • Chen and Goldberg (2018) Chen Y, Goldberg D (2018) Beating the curse of dimensionality in options pricing and optimal stopping. ArXiv e-prints .
  • Chen and Shi (2019) Chen Y, Shi C (2019) Network revenue management with online inverse batch gradient descent method. Available at SSRN .
  • Cheung et al. (2017) Cheung WC, Simchi-Levi D, Wang H (2017) Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6):1722–1731.
  • Ciocan and Farias (2012) Ciocan DF, Farias V (2012) Model predictive control for dynamic resource allocation. Mathematics of Operations Research 37(3):501–525.
  • Cooper (2002) Cooper WL (2002) Asymptotic behavior of an allocation policy for revenue management. Operations Research 50(4):720–727.
  • Dean et al. (2005) Dean BC, Goemans MX, Vondrák J (2005) Adaptivity and approximation for stochastic packing problems. Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 395–404 (Society for Industrial and Applied Mathematics).
  • Dean et al. (2008) Dean BC, Goemans MX, Vondrák J (2008) Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research 33(4):945–964.
  • Dütting et al. (2017) Dütting P, Feldman M, Kesselheim T, Lucier B (2017) Prophet inequalities made easy: Stochastic optimization by pricing non-stochastic inputs. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 540–551 (IEEE).
  • Erdelyi and Topaloglu (2011) Erdelyi A, Topaloglu H (2011) Using decomposition methods to solve pricing problems in network revenue management. Journal of Revenue and Pricing Management 10(4):325–343.
  • Ferreira et al. (2018) Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using thompson sampling. Operations research 66(6):1586–1602.
  • Gallego et al. (2018) Gallego G, Topaloglu H, et al. (2018) Revenue management and pricing analytics (Springer).
  • Gallego and Van Ryzin (1994) Gallego G, Van Ryzin G (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management science 40(8):999–1020.
  • Gallego and Van Ryzin (1997) Gallego G, Van Ryzin G (1997) A multiproduct dynamic pricing problem and its applications to network yield management. Operations research 45(1):24–41.
  • Gao et al. (2019) Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 501–511.
  • Hajiaghayi et al. (2007) Hajiaghayi MT, Kleinberg R, Sandholm T (2007) Automated online mechanism design and prophet inequalities. AAAI, volume 7, 58–65.
  • Jasin (2014) Jasin S (2014) Reoptimization and self-adjusting price control for network revenue management. Operations Research 62(5):1168–1178.
  • Jasin and Kumar (2012) Jasin S, Kumar S (2012) A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2):313–345.
  • Jørgensen et al. (2003) Jørgensen S, Taboubi S, Zaccour G (2003) Retail promotions with negative brand image effects: Is cooperation possible? European Journal of Operational Research 150(2):395–405.
  • Jun (2004) Jun T (2004) A survey on the bandit problem with switching costs. De Economist 152(4):513–541.
  • Kleinberg and Weinberg (2012) Kleinberg R, Weinberg SM (2012) Matroid prophet inequalities. Proceedings of the forty-fourth annual ACM symposium on Theory of computing, 123–136.
  • Lattimore and Szepesvári (2018) Lattimore T, Szepesvári C (2018) Bandit algorithms. preprint 28.
  • Levy et al. (1998) Levy D, Dutta S, Bergen M, Venable R (1998) Price adjustment at multiproduct retailers. Managerial and decision economics 19(2):81–120.
  • Li and Zheng (2019) Li X, Zheng Z (2019) Dynamic pricing with external information and inventory constraint. Available at SSRN .
  • Liu and Van Ryzin (2008) Liu Q, Van Ryzin G (2008) On the choice-based linear programming model for network revenue management. Manufacturing & Service Operations Management 10(2):288–310.
  • Ma et al. (2018) Ma W, Simchi-Levi D, Zhao J (2018) Dynamic pricing under a static calendar. Available at SSRN 3251015 .
  • Maglaras and Meissner (2006) Maglaras C, Meissner J (2006) Dynamic pricing strategies for multiproduct revenue management problems. Manufacturing & Service Operations Management 8(2):136–148.
  • Mehta et al. (2005) Mehta A, Saberi A, Vazirani U, Vazirani V (2005) Adwords and generalized on-line matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), 264–273 (IEEE).
  • Netessine (2006) Netessine S (2006) Dynamic pricing of inventory/capacity with infrequent price changes. European Journal of Operational Research 174(1):553–580.
  • Perakis and Singhvi (2019) Perakis G, Singhvi D (2019) Dynamic pricing with unknown non-parametric demand and limited price changes. Available at SSRN 3336949 .
  • Perchet et al. (2016) Perchet V, Rigollet P, Chassang S, Snowberg E, et al. (2016) Batched bandit problems. The Annals of Statistics 44(2):660–681.
  • Reiman and Wang (2008) Reiman MI, Wang Q (2008) An asymptotically optimal policy for a quantity-based network revenue management problem. Mathematics of Operations Research 33(2):257–282.
  • Secomandi (2008) Secomandi N (2008) An analysis of the control-algorithm re-solving issue in inventory and revenue management. Manufacturing & Service Operations Management 10(3):468–483.
  • Simchi-Levi and Xu (2019) Simchi-Levi D, Xu Y (2019) Phase transitions and cyclic phenomena in bandits with switching constraints. Advances in Neural Information Processing Systems, 7521–7530.
  • Slivkins (2019) Slivkins A (2019) Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .
  • Talluri and Van Ryzin (2006) Talluri KT, Van Ryzin GJ (2006) The theory and practice of revenue management, volume 68 (Springer Science & Business Media).
  • Topaloglu (2009) Topaloglu H (2009) Using lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Operations Research 57(3):637–649.
  • Truong and Wang (2019) Truong VA, Wang X (2019) Prophet inequality with correlated arrival probabilities, with application to two sided matchings. arXiv preprint arXiv:1901.02552 .
  • Vera and Banerjee (2019) Vera A, Banerjee S (2019) The bayesian prophet: A low-regret framework for online decision making. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 81–82 (ACM).
  • Vera et al. (2019) Vera A, Banerjee S, Gurvich I (2019) Online allocation and pricing: Constant regret via bellman inequalities. arXiv preprint arXiv:1906.06361 .
  • Wang et al. (2014) Wang Z, Deng S, Ye Y (2014) Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operations Research 62(2):318–331.
  • Zbaracki et al. (2004) Zbaracki MJ, Ritson M, Levy D, Dutta S, Bergen M (2004) Managerial and customer costs of price adjustment: direct evidence from industrial markets. Review of Economics and statistics 86(2):514–533.
\@trivlist

For any problem instance =(m,n,K,𝒑,𝒒,A). Any policy πΠ[𝖣𝖲𝖫-2] only selects no more than (𝖣𝖲𝖫-1) many price vectors. l[𝖣𝖲𝖫-1], let 𝒑al{𝒑1,𝒑2,,𝒑K} be the lth price that policy π selects; let τal be the total number of periods that price 𝒑al is offered, under policy π. Notice that both al and τal may be random, i.e. which price vector to offer and for how long each price vector is offered they both depend on the random trajectory.

Now denote Yj,al as the random amount of product j sold, during the periods that price vector al is offered. We know 𝔼[Yj,al|τal]=τalqj,al. Here τal is a random amount, so we cannot directly use concentration inequalities. But we can adapt the trick from Slivkins (2019). Suppose there was a tape of length T for each product j and each price vector al, with each cell independently sampled from the distribution of Qj,al. This tape serves as a coupling of the random reward: in each period t if price vector al is offered, we simply generate a demand from the tth cell of the tape. Now we can use Hoeffding inequality:

al,j,τal,Pr(|Yj,al-τalqj,al|3τallogT)1-2T6.

Denote the following event E:

al,j,τal,|Yj,al-τalqj,al|3τallogT. (13)

Using a union bound we have:

Pr(al,j,τal,|Yj,al-τalqj,al|3τallogT)1-2T3

because n,m,K are all less than T. The happening of such event means that the realized demands are close to the expected demands, suggesting that LP is a good proxy of any policy πΠ[𝖣𝖲𝖫-2].

Now if we focus on the usage of any price vector indexed by al[K], the total expected revenue is j[n]Yj,alpj,al. Then the total expected revenue generated during the entire horizon can be upper bounded by

l[𝖣𝖲𝖫-1]j[n]Yj,alpj,al l[𝖣𝖲𝖫-1]j[n](qj,alτal+3TlogT)pj,al
(l[𝖣𝖲𝖫-1]j[n]qj,alτalpj,al)+n2p𝗆𝖺𝗑3TlogT,

where the last inequality is because 𝖣𝖲𝖫(n+1). On the other hand, the consumption of inventory i must not violate the inventory constraint.

l[𝖣𝖲𝖫-1]j[n]Yj,alaijBi.

Lower bounding Yj,al by qj,alτal-3TlogT we have

l[𝖣𝖲𝖫-1]j[n]Yj,alaijBi+l[𝖣𝖲𝖫-1]j[n]3TlogTBi+n23TlogT.

Now construct the following LP:

𝖩𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖾𝖽=max{xk}k[K]k[K]j[n]pj,kqj,kxk
s.t.k[K]j[n]aijqj,kxk Bi+n23TlogT i[m]
k[K]xk T
xk 0 k[K].

If we let T and 𝑩 to scale linearly by κ, then the perturbed term n23TlogT is negligible, and so the optimal solution is scaled by κ. This means that if we restrict ourselves to a solution that uses no more than (𝖣𝖲𝖫-1) non-zero variables, we incur a linear loss, i.e. (l[𝖣𝖲𝖫-1]j[n]qj,alτalpj,al)=(1-Ω(1))𝖩𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖾𝖽.

Since from each solution 𝒙* of this perturbed LP, we can find a corresponding discounted solution 𝒙*(1+n23TlogTBmin) that is feasible to the DLP. This suggests that 𝖩𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖾𝖽𝖩𝖣𝖫𝖯(1+n23TlogTBmin), because DLP is a maximization problem.

Putting all together, and conditional on event E that happens with probability at least 1-2T3,

𝔼[𝖱𝖾𝗏(π)|E] (1-Ω(1))𝖩𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖾𝖽+n23TlogTp𝗆𝖺𝗑
(1-Ω(1))𝖩𝖣𝖫𝖯(1+n23TlogTBmin)+n23TlogTp𝗆𝖺𝗑
𝖩𝖣𝖫𝖯-Ω(T),

which suggests that 𝖱𝖾𝗏(π)𝖩𝖣𝖫𝖯-Ω(T). \@endparenv

\@trivlist

Let π be any policy suggested in Algorithm 1. Let 𝒙* be the associated optimal solution. We prove Theorem 2 by comparing the expected revenue earned by Algorithm 1 against a virtual policy π𝗏. This virtual policy π𝗏 mimics Algorithm 1 in steps 1–3. But in step 4, it sets the price vector to be 𝒑σ(1) for the first γxσ(1)* periods, then 𝒑σ(2) for the next γxσ(2)* periods, …, 𝒑σ(𝖣𝖲𝖫) for the next γxσ(𝖣𝖲𝖫)* periods, and finally 𝒑 for the last (T-γl=1𝖣𝖲𝖫xσ(l)*) periods. Here 𝒑 is a shut-off price, under which Qj(𝒑)=0,j[n].

Policy π𝗏 is virtual because it requires a shut-off price 𝒑 that may or may not be available. Moreover, it requires 𝖣𝖲𝖫 many price changes, which is more than (𝖣𝖲𝖫-1) many changes as suggested in Algorithm 1.

Policy π𝗏 serves to bridge our analysis. It breaks our Theorem 2 into two inequalities that we will prove separately.

𝖱𝖾𝗏(π)𝖱𝖾𝗏(π𝗏)(1-2a𝗆𝖺𝗑nTlogTBmin2-mT2)𝖩𝖣𝖫𝖯

For any policy π as defined in Algorithm 1 and its associated virtual policy π𝗏, they both solve the same DLP and have the same optimal solution. To prove the first inequality, note that both π and π𝗏 commit to the same prices in the first T~:=γl=1𝖣𝖲𝖫xσ(l)* time periods, and earns the same revenue following each trajectory of random demand. At the end of period T~, policy π still commits to 𝒑σ(𝖣𝖲𝖫), while policy π𝗏 makes one change and sets 𝒑. At the end of period T~, if the selling horizon has ended due to inventory stock-outs, then either policy earns zero revenue, so π and π𝗏 makes no difference. If the selling horizon has not ended and there is remaining inventory for any resource, then policy π earns non-negative revenue, while π𝗏 earns zero by setting a shut-off price. Following each trajectory of random demand, policy π earns more revenue than π𝗏. As a result, 𝖱𝖾𝗏(π)𝖱𝖾𝗏(π𝗏).

To prove the second inequality, we introduce the following notations. Let 𝟙t,k,k[K],t[T] be an indicator of whether or not policy π𝗏 offers price 𝒑k in period t.

𝟙t,k={1,if k=σ(1),tγxσ(1)*;1,if 1<l0𝖣𝖲𝖫,s.t.σ(l0)=k,γl=1l0-1xσ(l)*<tγl=1l0xσ(l)*;0,otherwise

𝟙t,k is deterministic once policy π𝗏 is determined.

Under policy π𝗏, following each trajectory of random demand, we define a the length of the effective selling horizon τ as function of a stopping time t0:

τ=T~min{t0-1|i,s.t.t=1t0k[K]𝟙t,kj[n]Qj,kaij>Bi}

The effective selling horizon is the minimum between (i) last period before the cumulative demand of any resource exceeds its initial inventory, and (ii) the last period before policy π𝗏 switches to the shut-off price.

Let Ct,i be the remaining inventory of resource i at the end of period t. Under this notation, C0,i=Bi. Note that Ct,i are random variables, and during the effective selling horizon, inventory updates in the following fashion

t[τ],Ct,i=Ct-1,i-k[K]𝟙t,kj[n]Qj,kaij0 (14)

Now we calculate the expected revenue.

𝖱𝖾𝗏(π𝗏) 𝔼Qj,k(ξt)[t=1T~𝟙{i,Ct-1,ina𝗆𝖺𝗑}k[K]𝟙t,kj[n]pj,kQj,k]
=t=1T~Pr(i,Ct-1,ina𝗆𝖺𝗑)k[K]𝟙t,kj[n]pj,kqj,k
Pr(i,CT~,ina𝗆𝖺𝗑)t=1T~k[K]𝟙t,kj[n]pj,kqj,k (15)

We explain the inequalities. The first inequality is because we only focus on the revenue earned if event {i,Ct-1,ina𝗆𝖺𝗑} happens, while ignoring the revenue earned if event {i,Ct-1,ina𝗆𝖺𝗑} does not happen; and when event {i,Ct-1,ina𝗆𝖺𝗑} does happen, the maximum amount of any inventory i demanded in one single period cannot exceed na𝗆𝖺𝗑. The first equality is expanding the expectations, where we use the fact that 𝟙{i,Ct-1,ina𝗆𝖺𝗑} and k[K]𝟙t,kj[n]pj,kQj,k are independent, because the indicator is a random event happening up to period t-1, while the summation term is a random amount happening in period t. The third inequality is due to (14), Ct,i is decreasing in t, so Ct-1,iCT~,i.

In this block of inequalities (15), the summation term

t=1T~k[K]𝟙t,kj[n]pj,kqj,k=l=1𝖣𝖲𝖫k{σ(l)=k}γxk*j[n]pj,kqj,k=γ𝖩𝖣𝖫𝖯,

since the indicators 𝟙t,k locate which k counts into this summation. So the next thing we do is to lower bound the probability.

Note that

𝔼[t=1T~k[K]𝟙t,kj[n]Qj,kaij] =t=1T~k[K]𝟙t,kj[n]qj,kaij
=l=1𝖣𝖲𝖫k{σ(l)=k}γxk*j[n]qj,kaij
γBi
<Bi-na𝗆𝖺𝗑

where the last (strict) inequality is because we plug in γ=1-2a𝗆𝖺𝗑nTlogTBmin2.

This above inequality suggests that for any i[m], the expected cumulative demand generated up till period T~ is strictly less than Bi-na𝗆𝖺𝗑. So we can use concentration inequalities.

Pr(i,CT~,ina𝗆𝖺𝗑) =1-Pr(i,s.t.t=1T~k[K]𝟙t,kj[n]Qj,kaijBi-na𝗆𝖺𝗑)
1-i[m]Pr(t=1T~k[K]𝟙t,kj[n]Qj,kaij-γBi(1-γ)Bi-na𝗆𝖺𝗑)
1-i[m]exp(-2((1-γ)Bi-na𝗆𝖺𝗑)2a𝗆𝖺𝗑2nT)
1-mexp(-2((1-γ)Bmin-na𝗆𝖺𝗑)2a𝗆𝖺𝗑2nT)
1-mT2 (16)

where the first inequality is due to union bound; the second inequality is due to Hoeffding inequality, Qj,kaij is bounded by a𝗆𝖺𝗑, and there are no more than nT such terms; the third inequality is because we lower bound each Bi by Bmin; the last inequality is when we plug in 1-γ=2a𝗆𝖺𝗑nTlogTBmin2, and we know that T>n.

Putting (16) into (15) we finish the proof. \@endparenv