This work is motivated by a practical concern from our retail partner. Whilethey respect the advantages of dynamic pricing, they must limit the number ofprice changes to be within some constant. We study the classical price-basednetwork revenue management problem, where a retailer has finite initialinventory of multiple resources to sell over a finite time horizon. We considerboth known and unknown distribution settings, and derive policies that have thebest-possible asymptotic performance in both settings. Our results suggest anintrinsic difference between the expected revenue associated with how manyswitches are allowed, which further depends on the number of resources. Ourresults are also the first to show a separation between the regret boundsassociated with different number of resources.
Quick Read (beta)
section§#2#1#3 \crefformatsubsection§#2#1#3 \crefformatsubsubsection§#2#1#3
Network Revenue Management with Limited Switches: Known and Unknown Demand Distributions
David Simchi-Levi, Yunzong Xu, Jinglong Zhao
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
Our work is motivated by a common business constraint in the retail industry. While retailers respect the advantages of dynamic pricing and price experimentation, they must limit the number of price changes to be within some range due to various practical reasons.
We study the classical price-based network revenue management problem, where a retailer has finite initial inventory of multiple resources to sell over a finite time horizon. We consider both known and unknown distribution settings, and derive policies that have the best-possible asymptotic performance in both settings. Our results suggest an intrinsic difference between the expected revenue associated with how many switches are allowed: in the distributionally-known setting, a resource-dependent amount of switching budgets is required to achieve sublinear regret; and in the distributionally-unknown setting, a resource-dependent piecewise-constant function that characterize the degree of sublinear regret.
To the best of our knowledge, our results are the first to exhibit a separation in regrets between inventory constrained and unconstrained stochastic online learning problems. Practically, retailers may benefit from our study in designing their budgets of price switches.
We consider the classical network revenue management problem. A firm has finite inventory of multiple resources to sell over a finite time horizon. The starting inventory is unreplenishable and exogenously given. The firm can control its sales through sequential decisions on the offered prices, which come from a discrete set of candidates. Its objective is to maximize the cumulated revenue.
We consider two settings when demand is either distributionally-known, or distributionally-unknown; in both cases we assume demand to be stochastic, independent and time homogeneous (stochastic IID demand) over the time horizon. Such settings are well studied in the literature. In the distributionally-known case, see Gallego and Van Ryzin (1997), Jasin (2014); in the distributionally-unknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018). The literature has also studied other settings assuming different demand models, where demand could be adversarial (Mehta et al. 2005, Ball and Queyranne 2009), could be correlated over time (Araman and Caldentey 2009, Truong and Wang 2019), and could be time non-homogeneous (Adelman 2007, Topaloglu 2009, Erdelyi and Topaloglu 2011). But all are beyond the scope of this paper.
In the distributionally-known setting, the firm must trade-off between revenue-centric decisions which maximize immediate expected revenue irrespective of inventory constraints, and inventory-centric decisions which maximize the yield from the remaining inventory. Revenue-centric decisions tend to be myopic and favor the most popular items, while inventory centric decisions tend to be conservative and favor the highly stocked items. Intuitively, the optimal policy alternates between revenue-centric and inventory-centric decisions based on the remaining inventory and time periods.
Moreover, in the distributionally-unknown setting, the firm must also trade-off between exploitation decisions which utilize the learned information to maximize the expected revenue as if it was in the distributionally-known setting, and exploration decisions which discover the demand distributions of the less certain actions, regardless of how rewarding an action is. Exploitation decisions tend to favor the more rewarding items (with respect to inventory constraints), while exploration decisions tend to favor the less discovered items. Intuitively, the optimal policy alternates between exploitation and exploration decisions based on the information learned from the realized demands.
In both settings, the optimal policy has to adjust its decisions and instantaneously switch between actions over the time horizon. However, not all retailers have the infrastructure to query the realized demand in real-time, to adjust its decisions instantaneously, or to switch between actions as freely as possible, because changing the posted prices is too costly for many retailers (Levy et al. 1998, Zbaracki et al. 2004), and frequent price changes may confuse the customers (Jørgensen et al. 2003). A common practice for many firms is that they restrict the number of price changes to be as few as possible; see Netessine (2006), Chen et al. (2015), Cheung et al. (2017), Chen and Chao (2019), Simchi-Levi and Xu (2019), Perakis and Singhvi (2019).
Motivated by this problem, we analyze the asymptotic performance of policies with limited switches on the network revenue management problem with known and unknown demand distribution. In both settings, we show there is an intrinsic difference between the expected revenue earned associated with how many switches are allowed, which further depends on the number of resources.
We consider the time horizon to consist of a discrete number of time periods. We model each product as having discrete “prices points” at which it could be sold. This captures situations where fixed price points have been pre-determined by market standards, e.g. a common menu of prices that end in $9.99: $69.99, $79.99, $99.99. There is quite a difference here from the papers that model each product as having a continuous price range; see Jasin (2014) for distributionally-known setting, and Wang et al. (2014), Chen and Shi (2019), Li and Zheng (2019) for distributionally-unknown setting. Our work requires completely different analytical techniques.
We model the business constraints of limited price changes as a hard constraint. The retailer is initially endowed with a constant number of switching budgets, to change the posted price (or price vector) from one to another. For example, if there is only one product, a sequence of prices uses two distinct prices, and makes three price changes.
We will consider the following two demand models one by one, where the design and analysis of effective policies to the second model are based on understandings of the first model.
Distributionally known demand (Section id1): In each time period, a stochastic, independent and time homogeneous demand is triggered from the prices we post.
Distributionally unknown demand (Section id1): We only know that the demand is independent and time homogeneous, but we do not have any prior knowledge about the distributions. We have to learn the distributions over time.
For both settings, we can generalize our results to the “Bandits with Knapsacks” model (Badanidiyuru et al. 2013), which allows the per-unit revenue to be random, and further allows for any correlated distributions of the consumption quantity and the revenue.
We focus on the asymptotic performance of policies. We introduce our results in both settings using linear scaling as our asymptotic regime, which is omnipresent in the literature. In the distributionally-known case, see Gallego and Van Ryzin (1997), Liu and Van Ryzin (2008), Jasin (2014), Bumpensanti and Wang (2018); in the distributionally-unknown case, see Besbes and Zeevi (2012), Ferreira et al. (2018), Chen and Shi (2019). Specifically, we assume both the time horizon and all the initial inventory levels are scaled by the same factor , while demand distributions in all periods remain the same.
There is a second asymptotic regime in the literature that does not require linear scaling. Instead, it only requires the minimum quantity of any resource (and time periods) to go to infinity. In the distributionally-known case, see Alaei et al. (2012), Ma et al. (2018); in the distributionally-unknown case, see Badanidiyuru et al. (2013).
We adopt the revenue loss as our optimality notion. In the distributionally-known setting, revenue loss is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal policy with infinite switching budgets. The optimal policy with infinite switching budgets could be solved using a dynamic program, yet due to curse of dimensionality people deem it hard to solve (Gallego and Van Ryzin 1997). Revenue loss is one common metric to quantify the optimality gap, among the rich literature of different metrics; see Adelman (2007), Jasin (2014), Chen and Goldberg (2018).
In the distributionally-unknown setting, revenue loss is often referred to as “regret”. We study two regret notions in this setting: the first one, referred to as the -switch regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with the same switching budget; the second one, referred to as the overall regret, is the gap of expected revenue between our proposed policy (with a finite switching budget) and the optimal clairvoyant policy with infinite switching budgets. (Here, a clairvoyant policy is endowed with perfect knowledge of the distributions, but not the exact realizations.) While the overall regret is clearly larger than the -switch regret, interestingly, we show that asymptotically they are of the same order.
We provide full characterization of the revenue loss under two demand models, endowed with different levels of switching budgets.
In the distributionally-known setting (Section id1), there exists a critical switching budget above which we show the revenue loss is in the order of ; and below which the loss is in the order of . Specifically, we show that for any problem instance, there exists an instance dependent quantity (“Dimension of the Sparsest LP-solution”, see the precise definition in Section id1), such that any policy that makes at most switches must incur an revenue loss, while a class of static control policies that make at most switches can guarantee an revenue loss. Combining the above results, we show that there is an intrinsic gap on expected revenue loss, if one more critical price change is allowed. See Figure 1.
In the distributionally-unknown setting (Section id1), we show matching upper and lower bounds on the optimal (worst-case) regret in the order of , where . Here stands for the switching budget, for the number of resources, and for the number of feasible price vectors, see Figure 2 for an illustration. Specifically, we propose an S-Switch Balanced Exploration with LP-based Exploitation algorithm that achieves an regret; and we show a lower bound that any policy must incur an regret in the worst case distributions. Our result generalizes the bounds in Simchi-Levi and Xu (2019), which is a special case when , i.e., no inventory constraints.
In the distributionally-unknown setting, our results reveal an interesting separation between the (stochastic) online learning problems with inventory constraints and without inventory constraints. Curiosity has been aroused around this separation for long, that the regret bounds in both the classical multi-armed bandit problems and the bandits with knapsacks problems are in the order of . Yet people believe that the problem with inventory constraints is “harder” than the unconstrained counterparts. Our work shows that the constrained and unconstrained problems are indeed different, and characterize the dependency on the number of inventory constraints. If we fix all the other problem primitives unchanged and only add in one more inventory constraint, then the regret bounds is going to be larger or equal to before, illustrating that inventory constraints indeed increase the revenue loss – they make the problem “harder”.
We establish equivalence results of limited switching budgets and limited adaptivity, in the context of network revenue management problems. Limited switching budgets, as we have introduced in Section id1, is the hard constraint that one cannot change prices more than a fixed number of times. Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. We can think of full adaptivity as the ability to collect immediate feedbacks, or to query the status of the system; and limited adaptivity corresponds to collecting batched and delayed feedbacks. An algorithm with limited switching budgets can keep track of the demands in each period, yet constrained on price changes; while an algorithm with limited adaptivity can prescribe a trajectory of different prices with unlimited changes, yet without knowing the status of the system. When either ability (switching budgets or adaptivity) is constrained, we do not need the other ability more than necessary.
Finally, we introduce new techniques in the design and analysis of asymptotically optimal algorithms, which we will discuss in the following sections when we introduce the proof sketches.
There are two streams of network revenue management problems, the quantity-based problem where customers first arrive and then seller makes accept / reject decisions, and the price-based problem where seller first announces a price and then customers makes purchase decisions. For more discussion, see Talluri and Van Ryzin (2006), Gallego et al. (2018). Quantity-based and price-based problems are very closely related. For example, when there is only one single resource, these two problems are equivalent (Maglaras and Meissner 2006).
There is a rich literature in re-optimization techniques in network revenue management, featuring in frequent or infrequent adjustments in the action space (depending on if the problem is quantity- or price- based). Re-optimization techniques choose careful moments in time to make adjustments (e.g. to change prices in the price-based context), to better respond to the randomness, and to achieve a smaller expected revenue loss. Jasin (2014), in the context of price-based network revenue management, requires number of price changes to achieve a order of expected revenue loss, compared to the DLP upper bound. For more discussion of the re-optimization techniques, as well as the selection of different upper bounds, see Reiman and Wang (2008), Secomandi (2008), Chen and Homem-de Mello (2010), Ciocan and Farias (2012), Jasin and Kumar (2012), Bumpensanti and Wang (2018) for quantity-based re-optimization methods; see Jasin (2014), Chen et al. (2015) for price-based re-optimization methods; see Chen and Farias (2013), Vera et al. (2019) for re-optimization methods on single-resource dynamic pricing problems; see Arlotto and Xie (2018), Arlotto and Gurvich (2019), Vera and Banerjee (2019) for re-optimization methods on multi-secretary problems and multiple constraint knapsack problems, which are directly related to the network revenue management problem.
There is a rich literature on stochastic multi-armed bandit (MAB) problems with switching costs Agrawal et al. (1988, 1990), Cesa-Bianchi et al. (2013). See Jun (2004) for a survey. Most paper from this stream of research penalizes switching costs into the objective function of the reward calculation. That is, the objective is to minimize a convex combination of the regret incurred, and the switching costs. By contrast, our paper treats the switching budgets as hard constraint that can never be violated.
A closely related paper is Simchi-Levi and Xu (2019), which discusses the stochastic multi-arm bandit problem with the same hard constraint of switching budgets as us. Our work generalize their results by incorporating additional knapsack constraints Badanidiyuru et al. (2013). This extension gives us more modeling flexibility in network revenue management, as well as medical trials, and online advertising, etc. Moreover, our work develops significant new techniques to address new challenges due to the knapsack constraints. Even in the distributionally-known case, a revenue maximizing policy has to switch between prices to avoid consuming some inventory too quickly; by contrast, without knapsack constraints, the best policy always sticks to one single price.
There are several ways to model the business constraint of limited switching budgets. Literature have studied quite a few models of infrequent price changes, which are similar to the switching budgets we consider in our paper. In the distributionally-known setting, Netessine (2006) considers a single-product deterministic demand model, and focuses on the impact of a limited number of price changes. Chen et al. (2015) considers the same network revenue management model as we do, and focuses on asynchronously changing the prices of a small subset of products. In the distributionally unknown setting, Cheung et al. (2017) considers a dynamic pricing model where the demand function is unknown but belongs to a known finite set, and a pricing policy makes limited number of price changes. Chen and Chao (2019) studies a multi-period stochastic joint inventory replenishment and pricing problem with unknown demand and limited price changes. Simchi-Levi and Xu (2019) considers the stochastic multi-armed bandit problem with a general switching constraint. All of the above three papers adopt the idea of switching budgets as this paper does. Yet none of the above paper considers the existence of non-replenishable inventory constraints.
To the best of our knowledge, this paper is the first to characterize the exact dependency of expected revenue loss on limited switching budgets, in the price-based network revenue management setup, with and without the knowledge of demand distributions. In particularly, results in this paper generalize and unify several results appearing in the following papers: Gallego and Van Ryzin (1994), Besbes and Zeevi (2012), Badanidiyuru et al. (2013), Ferreira et al. (2018), Ma et al. (2018), Simchi-Levi and Xu (2019).
Throughout this paper, let be the set of real numbers, and the set of non-negative real numbers. Let be the set of positive integers. , let denote the set of integers that are no more than . We use bold font letters for vectors, where we do not explicitly indicate how large the dimension is. Let be the transpose of any vector . For any vector , let be the norm of , i.e. the number of non-zero elements in vector . For any real number , let be the non-negative part of . For any positive real number , let be the largest integer that is smaller or equal to ; for any non-positive real number , let . We use big notations to hide constant factors, and use notations to hide both constant factors and logarithmic factors.
We study the classical network revenue management problem. Let there be discrete, finite time horizon with periods. Time starts from period and ends in period . Let there be different products generated by different resources, each resource endowed with finite initial inventory . Let be the consumption matrix. Each entry stands for the amount of inventory used, if one unit of product is sold. Let . Each column stands for the consumption vector of each resource. Each column contains at least one nonzero entry. Let denote the -th row of .
In each period , the firm can offer a price for each product from a finite set of price vectors, which we denote using . Here are the prices for products , under price vector . All the price vectors are fixed and given. Let be the maximum price. Given price , the demand for each product is a given Bernoulli random variable . Denote as the expected demand. Our analysis can be directly generalized to general bounded distributions , similar to most prophet inequality and revenue management literature Kleinberg and Weinberg (2012), Alaei et al. (2012), Dütting et al. (2017), Liu and Van Ryzin (2008), Jasin (2014), Ma et al. (2018). For the ease of exposition we describe our results in the Bernoulli setting, yet our proof only relies on the fact that demands are bounded between .
For each unit of demand generated for product under price vector , the firm generates units of revenue by depleting units of each inventory . If no demand is generated, all the remaining inventory is carried over into the next period. The selling process stops immediately when the total cumulative demand of any resource exceeds its initial inventory.
This modeling framework gives us the flexibility to have two identical consumption vectors sold at two different prices. We simply treat them as different products, i.e. in the airline context, main cabin vs. basic economy.
In Section id1 we have addressed the business constraint that a firm is often prevented from changing the price vectors too many times. Throughout the horizon, we can change the posted price vectors for no more than times. We treat as a fixed constant. Intuitively, when is small we are more constrained, and when is large we are close to the classical network revenue management problem studied in the literature. Our objective is to maximize the total expected revenue in the face of limited switches.
We adopt the general notation to denote any policy with the full information about stochastic distributions, that suggests which price vector to use given the remaining periods and remaining inventory . For any , let be the set of policies that changes prices for no more than times. For any such that , . Let be the set of policies with infinite switching budgets (which is the set of all feasible policies when there is no switching constraint). Let be the expected revenue that policy generates in expectation. Let be one of the optimal dynamic policies with switching budget .
We adopt linear scaling as our asymptotic regime. We assume all the problem parameters are fixed, where and . We assume and are at the same comparable order, and larger than either one of .
We summarize our results, as well as some closely related results from literature, in Figure 3.
For any problem instance , litarature have studied the following deterministic linear program (DLP).
It is well known that the above DLP serves as an upper bound to any policy (“DP OPT” as in Figure 3), even an optimal policy with infinite switching budgets. The worst-case gap between the best policy and the DLP upper bound is in the order of 11 1 We give some comments on the lower bound for this gap. It is a worst-case lower bound from Bumpensanti and Wang (2018), Ma et al. (2018). Both these results from literature require the feasible price set to have only one price (vector). Since there is only one price, the optimal dynamic programming policy is essentially a fixed, zero-switch policy, which is a special case when . It remains an interesting open question how small could the gap be, between the DLP upper bound and any policy endowed with more than switching budgets., as shown in Figure 3.
Our results contribute to the literature by first characterizing how a policy’s switching budget affects its revenue loss. The main implication of our results is that for each problem instance, there is an intrinsic gap between the best achievable performance of policies with switches and policies with switches, as shown in Figure 3.
We start with a closer look of the DLP optimal solutions. Let the set of optimal solutions from the DLP be . Let be the least number of non-zero variables of any optimal solution, namely, the Dimension of Sparsest LP–solutions. Let be the set of such solutions. For any , let be the subset of dimensions that are non-zero in .
For any problem instance , there is an associated number. Any policy earns an expected revenue
We outline three key steps here and defer the details of our proof to Appendix id1. We first identify a clean event, such that the realized demands are close to the expected demands that the LP suggests. This clean event happens with high probability . In the second step, conditioning on such event, the maximum amount of revenue we generate is no more than compared to what the LP suggests; and the minimum amount of inventory demanded is no less than compared to what the LP suggests, resulting in no more than of realized revenue. In the third step, we show that the revenue loss from insufficient price changes is scaled in the order of , which dominates the amount revenue due to randomness. Such clean event analysis, originating from the online learning literature (Badanidiyuru et al. 2013, Lattimore and Szepesvári 2018, Slivkins 2019), is new to the stochastic control literature.
We highlight that the lower bound established in Theorem 1 is a per-instance lower bound, as it holds for every single problem instance. Such bound is much stronger than the worst-case type lower bounds that are omnipresent in the revenue management literature, which only holds for a specific instance, see the following example in our setting.
There exists a problem instance , such that any policy earns an expected revenue
The static control policy is a well known policy in the literature that has an asymptotic guarantee of revenue loss. See Gallego and Van Ryzin (1997), Cooper (2002), Maglaras and Meissner (2006), Liu and Van Ryzin (2008). In this section, we modify the well-established static control policy to prove a similar result in our setting, when the selling horizon stops immediately when the total cumulative demand of any resource exceeds its initial inventory. Our setting requires different techniques, because the papers listed above on the static control policy do not explicitly address when the cumulative demand would exceed the initial inventory. When this does happen half-way, the static control policy loses chances to gain revenue in the remaining periods, even if there are still other remaining inventory.
Nonetheless, we tweak the static control policy, so that with high probability the selling horizon never stops earlier than the last period . This is achieved by selecting the value of in the first step of Algorithm 1. Our policy has an asymptotic guarantee of revenue loss, which is slightly worse off by a factor than the revenue loss in the classical setting. The term comes from the Chernoff bound using concentration inequalities. Similar ideas have been used in Hajiaghayi et al. (2007), Ma et al. (2018), Balseiro et al. (2019) to prove asymptotic results in similar but different settings.
We explain the third step permutation. Suppose . In this case, and there are permutations. There are possible policies as suggested in Definition 1. Some of these policies have better performance than others. But in an asymptotic regime, they all have sublinear regret, compared to the DLP upper bound.
Any policy as defined in Algorithm 1 earns an expected revenue
We outline two key steps here and defer the details of our proof to Appendix id1. In the first step, we show that with high probability, the selling horizon never stops earlier than the last period . Second, conditioning on this high probability event, the expected revenue is at least fraction of the LP objective. Putting two steps together we know the total revenue loss is in the order of .
Combining Theorem 2 with Theorem 1, we identify an intrinsic performance gap when the switching budget is around a critical threshold . See Figure 3. Combining Theorem 2 with Proposition 1, we know that the Tweaked LP policy is asymptotically optimal (in the sense that it achieves the optimal revenue loss rate relative to the DLP upper bound), using only a few switches.
We keep studying the model described in Section id1, in the distributionally-unknown setting. Given price , the demand for each product is an unknown Bernoulli22 2 All our results can be directly extended to general unknown bounded random variables without further steps. random variable, . That is, the expected value is unknown and has to be sequentially learned over time. Our analysis can be directly generalized to general bounded distributions , similar to most online learning literature (Lattimore and Szepesvári 2018, Slivkins 2019). The main trade-off we would like to address in this setting is between exploration and exploitation. Let denote the true “mean demand matrix” that is unknown to the seller. We also call an “environment” The seller wants to learn the true environment while concurrently maximizing the expected total revenue over periods.
We adopt the following asymptotic regime that is also considered in Besbes and Zeevi (2012) and Ferreira et al. (2018): (1) the problem parameters are fixed, and (2) and are at the same comparable order, and much larger than either one of . In other words, for all , while . Without loss of generality, we assume that (otherwise there are always redundant resource constraints that can be eliminated).
Let denote any non-anticipating learning policy, and denote the price vector chosen by policy at period . More formally, establishes a probability kernel acting from the space of historical actions and observations to the space of actions at period . For any , let be the set of learning policies that change price vectors for no more than times almost surely for all . For any such that , . Let be the set of learning policies with infinite switching budgets. Let be the expected revenue that a learning policy generates under environment .
Let denote any clairvoyant policy with the full knowledge of the true environment . For any , let be the set of clairvoyant policies that change price vectors for no more than times under the true environment . For any such that , . Let be the set of clairvoyant policies with infinite switching budgets. Let be the expected revenue that a clairvoyant policy generates under environment . Let be the optimal clairvoyant policy with an switching budget of .
The performance of an -switch learning policy is measured against the performance of the optimal -switch clairvoyant policy . Specifically, we define the -switch regret of a learning policy as the worst-case difference between the expected revenue of the optimal -switch clairvoyant policy and the expected revenue of policy :
We also measure the performance of policy against the performance of the optimal unlimited-switch clairvoyant policy . Specifically, we define the overall regret of a learning policy as the worst-case difference between the expected revenue of the optimal unlimited-switch clairvoyant policy and the expected revenue of policy :
Intuitively, the -switch regret measures the “informational revenue loss” due to not knowing the demand distributions, while the overall regret measures the “overall revenue loss” due to not knowing the demand distributions and not being able to switch freely. Clearly, the overall regret is always larger than the -switch regret . In addition, according to the results in Section id1, we know that for all ,
and for all , there exist such that
Consider the DLP defined by (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), (the-at-equationgroup-at-ID), and (the-at-equationgroup-at-ID). Let denote the DLP when its environment is . Let denote the objective value of a feasible solution to . Let denote the optimal objective value of .
We propose a policy that provides an upper bound on both the -switch regret and the overall regret. Our policy, called the -Switch Balanced Exploration with LP-based Exploitation (SS-BELE) policy, is described in Algorithm 2. The policy presents a novel combination of the ingredients from the SS-SE policy proposed in Simchi-Levi and Xu (2019), the Balanced Exploration policy proposed in Badanidiyuru et al. (2013), and the Tweaked LP policy defined in Algorithm 1.
Initialization: Calculate . Define and
Let denote the ending period of epoch (which will be determined by the algorithm). Let .
Let denote the algorithm’s action at period . Let be a random action in .
Policy: \[email protected]@algorithmic \FORepoch \STATE(Skip this step if ) Let be the total number of periods that price vector is chosen up to period , and be the empirical mean demand of product sold at price vector up to period . Calculate . Update
Compute the set . \STATEFor each , pick any solution such that . Let for all . \STATELet . Starting from this action, choose each price vector for consecutive periods, (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. \STATEEnd epoch . Mark the last chosen action in epoch as . \ENDFOR\STATEFor epoch , calculate . Solve and find an optimal solution to with the least number of non-zero variables, . Let for all . Let . Starting from this action, choose each price vector for consecutive periods, (we overlook the rounding issues here, which are easy to fix in regret analysis). Stop the algorithm once time horizon is met or one of the resources is exhausted. End epoch .
We show that the SS-BELE policy is indeed an -switch policy and establish the following upper bound on its regret.
Let be the SS-BELE policy, then and
Let be the last epoch in the execution of policy , and let be the last period before the policy stops. We know that is a stopping time and we have . Since makes at most switches before and makes at most switches after , its total number of switches is always upper bounded by .
We use a coupling argument for the regret analysis. Consider a virtual policy that runs under exactly the same demand realization process and acts exactly the same as until period , but keeps running until the end of epoch regardless of the inventory constraints. Without conflicts to the previously defined notations in Algorithm 2, let denote the last period of epoch under policy (), let be the total number of periods that price vector is chosen by during period 1 to (), and let be the average realized demand of product sold at price vector during period 1 to (). Define the confidence radius as
Define the clean event as
By the Hoeffding’s inequality for general bounded random variables and arguments analogous to Lemma 1.5 in Slivkins (2019), we have under environment and policy . Since the clean event happens with very high probability, we can just focus on a clean execution of policy : an execution in which the clean event holds.
Conditional on the clean event, for all and , we have
where the last inequality is by Jensen’s inequality. For notational convenience, define for all . Then we have for all . For all , let denote an environment such that is an optimal solution to . Then for all and , we have
where the first inequality is by the definition of and the clean execution of policy , and the second inequality is by the feasibility of to . We now bound . Since , we have . By the specification of Algorithm 2, we have for all . Hence
This indicates that conditional on the clean event, policy will not violate any inventory constraint. By the coupling relationship between and , we know that conditional on the clean execution of policy . Thus conditonal on the clean event, the total revenue collected by policy is
We now bound conditional on the clean event. We have
where the last inequality is by (theequation-at-IDab) and
Let denote an optimal solution to , we have
By arguments analogous to (theequation-at-IDab), we have
Thus both and are upper bounded by
where is a constant that only depends on and . \@endparenv
We prove a matching lower bound for both the -switch regret and the overall regret.
For any , there exists a distrubitionally-unknown problem instance such that for all ,
Proof Idea. We construct a problem instance as follows. Let there be resources. Let
so that each product only consumes one resource: product consumes resource , and products and consume resource . For some , let
Define to be the unique solution over to:
where are small gaps that we will assign different values in the lower bound proof. The main property of the above problem instance is that as long as are all smaller than a constant threshold, no matter how we adjust , the resulting is always non-degenerate, i.e., the resulting is . Then by Theorem 1, even for a clairvoyant policy that knows , it needs to make at least switches to guarantee a sublinear regret. The rest of the proof is built on the “tracking the covering time” argument, originally proposed in Simchi-Levi and Xu (2019). However, our proof require more delicate analysis for the values of different LP, and require us to integrate Theorem 1 into the analysis of the last two stopping times.
Combining Theorem 3 and Theorem 4, we can see that for any switching budget , the optimal (worst-case) regret of the distributionally-unknown network revenue management problem is in the order of . This suggests that given a fixed switching budget , an increase in the number of resources may result in an increase in the order of the optimal regret. To the best our knowledge, this is the first result of such kind that explicitly characterizes how the dimension of the inventory constraints makes a stochastic online learning problem “harder”. In other words, an increase in the number of resources increases the required number of switches to achieve a same regret bound. While both the classical multi-armed bandit problem and the bandits with knapsacks problem (Badanidiyuru et al. 2013) / distributionally-unknown network revenue management problem (Besbes and Zeevi 2012, Ferreira et al. 2018) studied in previous literature essentially exhibit the same optimal regret rate in the order of , the order of is not affected by the dimension of the inventory constraints. Our results indicate an intrinsic separation in the optimal regret bounds, due to the switching budget constraint.
Limited adaptivity, as originally introduced in Dean et al. (2005, 2008) for stochastic packing and stochastic knapsack problems and in Perchet et al. (2016) for stochastic bandit problems, is the hard constraint that one cannot collect feedbacks and adapt to the feedbacks more than a fixed number of times. As a concrete example of limited-adapativity problems, the -batched multi-armed bandit problem is a variant of the classical MAB problem where the decision maker must split her learning process into batches and is only able to observe data from a given batch after the entire batch is completed, see Perchet et al. (2016), Gao et al. (2019).
Our SS-BELE policy is not only a limited-switch policy, but also a limited-adaptivity policy — throughout the entire periods, it only queries the collected data for at most times. Indeed, the policy runs in at most epochs, and within each epoch the policy’s actions can be pre-determined in the beginning of the epoch. Our policy can thus be treated as an -batched policy, providing regret for the -batched network revenue management / bandits with knapsacks problems, which generalizes the previous upper bound results for unconstrained batched bandit problems (Perchet et al. 2016, Gao et al. 2019). Moreover, since a regret lower bound for limited-switch policies is quite strong and always implies a regret lower bound for limited-adaptivity policies (as noted in Simchi-Levi and Xu 2019), our lower bound in Theorem 4 directly provides a matching regret lower bound for the -batched network revenue management / bandits with knapsacks problems. We thus state the following results:
For a distributionally-unknown network revenue management problem parameterized by fixed (, , , , , ), there exists a mapping between the family of limited adaptivity problems and the family of limited switching budget problems, such that following each mapping, the two corresponding problems have the same optimal regret (up to logarithmic factors).
Note that Simchi-Levi and Xu (2019) establishes a similar “regret equivalence” result between limited switches and limited adaptivity for the classical MAB problem without inventory constraints. Our regret equivalence result is a generalization of their result to inventory-constrained settings.
- Adelman (2007) Adelman D (2007) Dynamic bid prices in revenue management. Operations Research 55(4):647–661.
- Agrawal et al. (1988) Agrawal R, Hedge M, Teneketzis D (1988) Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control 33(10):899–906.
- Agrawal et al. (1990) Agrawal R, Hegde M, Teneketzis D (1990) Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic reports 29(4):437–459.
- Alaei et al. (2012) Alaei S, Hajiaghayi M, Liaghat V (2012) Online prophet-inequality matching with applications to ad allocation. Proceedings of the 13th ACM Conference on Electronic Commerce, 18–35 (ACM).
- Araman and Caldentey (2009) Araman VF, Caldentey R (2009) Dynamic pricing for nonperishable products with demand learning. Operations research 57(5):1169–1188.
- Arlotto and Gurvich (2019) Arlotto A, Gurvich I (2019) Uniformly bounded regret in the multisecretary problem. Stochastic Systems 9(3):231–260.
- Arlotto and Xie (2018) Arlotto A, Xie X (2018) Logarithmic regret in the dynamic and stochastic knapsack problem. arXiv preprint arXiv:1809.02016 .
- Badanidiyuru et al. (2013) Badanidiyuru A, Kleinberg R, Slivkins A (2013) Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 207–216 (IEEE).
- Ball and Queyranne (2009) Ball MO, Queyranne M (2009) Toward robust revenue management: Competitive analysis of online booking. Operations Research 57(4):950–963.
- Balseiro et al. (2019) Balseiro SR, Brown DB, Chen C (2019) Dynamic pricing of relocating resources in large networks. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 29–30 (ACM).
- Besbes and Zeevi (2012) Besbes O, Zeevi A (2012) Blind network revenue management. Operations research 60(6):1537–1550.
- Bumpensanti and Wang (2018) Bumpensanti P, Wang H (2018) A re-solving heuristic with uniformly bounded loss for network revenue management. arXiv preprint arXiv:1802.06192 .
- Cesa-Bianchi et al. (2013) Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 1160–1168.
- Chen and Chao (2019) Chen B, Chao X (2019) Parametric demand learning with limited price explorations in a backlog stochastic inventory system. IISE Transactions 1–9.
- Chen and Homem-de Mello (2010) Chen L, Homem-de Mello T (2010) Re-solving stochastic programming models for airline revenue management. Annals of Operations Research 177(1):91–114.
- Chen et al. (2015) Chen Q, Jasin S, Duenyas I (2015) Real-time dynamic pricing with minimal and flexible price adjustment. Management Science 62(8):2437–2455.
- Chen and Farias (2013) Chen Y, Farias VF (2013) Simple policies for dynamic pricing with imperfect forecasts. Operations Research 61(3):612–624.
- Chen and Goldberg (2018) Chen Y, Goldberg D (2018) Beating the curse of dimensionality in options pricing and optimal stopping. ArXiv e-prints .
- Chen and Shi (2019) Chen Y, Shi C (2019) Network revenue management with online inverse batch gradient descent method. Available at SSRN .
- Cheung et al. (2017) Cheung WC, Simchi-Levi D, Wang H (2017) Dynamic pricing and demand learning with limited price experimentation. Operations Research 65(6):1722–1731.
- Ciocan and Farias (2012) Ciocan DF, Farias V (2012) Model predictive control for dynamic resource allocation. Mathematics of Operations Research 37(3):501–525.
- Cooper (2002) Cooper WL (2002) Asymptotic behavior of an allocation policy for revenue management. Operations Research 50(4):720–727.
- Dean et al. (2005) Dean BC, Goemans MX, Vondrák J (2005) Adaptivity and approximation for stochastic packing problems. Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 395–404 (Society for Industrial and Applied Mathematics).
- Dean et al. (2008) Dean BC, Goemans MX, Vondrák J (2008) Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research 33(4):945–964.
- Dütting et al. (2017) Dütting P, Feldman M, Kesselheim T, Lucier B (2017) Prophet inequalities made easy: Stochastic optimization by pricing non-stochastic inputs. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 540–551 (IEEE).
- Erdelyi and Topaloglu (2011) Erdelyi A, Topaloglu H (2011) Using decomposition methods to solve pricing problems in network revenue management. Journal of Revenue and Pricing Management 10(4):325–343.
- Ferreira et al. (2018) Ferreira KJ, Simchi-Levi D, Wang H (2018) Online network revenue management using thompson sampling. Operations research 66(6):1586–1602.
- Gallego et al. (2018) Gallego G, Topaloglu H, et al. (2018) Revenue management and pricing analytics (Springer).
- Gallego and Van Ryzin (1994) Gallego G, Van Ryzin G (1994) Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management science 40(8):999–1020.
- Gallego and Van Ryzin (1997) Gallego G, Van Ryzin G (1997) A multiproduct dynamic pricing problem and its applications to network yield management. Operations research 45(1):24–41.
- Gao et al. (2019) Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 501–511.
- Hajiaghayi et al. (2007) Hajiaghayi MT, Kleinberg R, Sandholm T (2007) Automated online mechanism design and prophet inequalities. AAAI, volume 7, 58–65.
- Jasin (2014) Jasin S (2014) Reoptimization and self-adjusting price control for network revenue management. Operations Research 62(5):1168–1178.
- Jasin and Kumar (2012) Jasin S, Kumar S (2012) A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2):313–345.
- Jørgensen et al. (2003) Jørgensen S, Taboubi S, Zaccour G (2003) Retail promotions with negative brand image effects: Is cooperation possible? European Journal of Operational Research 150(2):395–405.
- Jun (2004) Jun T (2004) A survey on the bandit problem with switching costs. De Economist 152(4):513–541.
- Kleinberg and Weinberg (2012) Kleinberg R, Weinberg SM (2012) Matroid prophet inequalities. Proceedings of the forty-fourth annual ACM symposium on Theory of computing, 123–136.
- Lattimore and Szepesvári (2018) Lattimore T, Szepesvári C (2018) Bandit algorithms. preprint 28.
- Levy et al. (1998) Levy D, Dutta S, Bergen M, Venable R (1998) Price adjustment at multiproduct retailers. Managerial and decision economics 19(2):81–120.
- Li and Zheng (2019) Li X, Zheng Z (2019) Dynamic pricing with external information and inventory constraint. Available at SSRN .
- Liu and Van Ryzin (2008) Liu Q, Van Ryzin G (2008) On the choice-based linear programming model for network revenue management. Manufacturing & Service Operations Management 10(2):288–310.
- Ma et al. (2018) Ma W, Simchi-Levi D, Zhao J (2018) Dynamic pricing under a static calendar. Available at SSRN 3251015 .
- Maglaras and Meissner (2006) Maglaras C, Meissner J (2006) Dynamic pricing strategies for multiproduct revenue management problems. Manufacturing & Service Operations Management 8(2):136–148.
- Mehta et al. (2005) Mehta A, Saberi A, Vazirani U, Vazirani V (2005) Adwords and generalized on-line matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), 264–273 (IEEE).
- Netessine (2006) Netessine S (2006) Dynamic pricing of inventory/capacity with infrequent price changes. European Journal of Operational Research 174(1):553–580.
- Perakis and Singhvi (2019) Perakis G, Singhvi D (2019) Dynamic pricing with unknown non-parametric demand and limited price changes. Available at SSRN 3336949 .
- Perchet et al. (2016) Perchet V, Rigollet P, Chassang S, Snowberg E, et al. (2016) Batched bandit problems. The Annals of Statistics 44(2):660–681.
- Reiman and Wang (2008) Reiman MI, Wang Q (2008) An asymptotically optimal policy for a quantity-based network revenue management problem. Mathematics of Operations Research 33(2):257–282.
- Secomandi (2008) Secomandi N (2008) An analysis of the control-algorithm re-solving issue in inventory and revenue management. Manufacturing & Service Operations Management 10(3):468–483.
- Simchi-Levi and Xu (2019) Simchi-Levi D, Xu Y (2019) Phase transitions and cyclic phenomena in bandits with switching constraints. Advances in Neural Information Processing Systems, 7521–7530.
- Slivkins (2019) Slivkins A (2019) Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .
- Talluri and Van Ryzin (2006) Talluri KT, Van Ryzin GJ (2006) The theory and practice of revenue management, volume 68 (Springer Science & Business Media).
- Topaloglu (2009) Topaloglu H (2009) Using lagrangian relaxation to compute capacity-dependent bid prices in network revenue management. Operations Research 57(3):637–649.
- Truong and Wang (2019) Truong VA, Wang X (2019) Prophet inequality with correlated arrival probabilities, with application to two sided matchings. arXiv preprint arXiv:1901.02552 .
- Vera and Banerjee (2019) Vera A, Banerjee S (2019) The bayesian prophet: A low-regret framework for online decision making. Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, 81–82 (ACM).
- Vera et al. (2019) Vera A, Banerjee S, Gurvich I (2019) Online allocation and pricing: Constant regret via bellman inequalities. arXiv preprint arXiv:1906.06361 .
- Wang et al. (2014) Wang Z, Deng S, Ye Y (2014) Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operations Research 62(2):318–331.
- Zbaracki et al. (2004) Zbaracki MJ, Ritson M, Levy D, Dutta S, Bergen M (2004) Managerial and customer costs of price adjustment: direct evidence from industrial markets. Review of Economics and statistics 86(2):514–533.
For any problem instance . Any policy only selects no more than many price vectors. , let be the price that policy selects; let be the total number of periods that price is offered, under policy . Notice that both and may be random, i.e. which price vector to offer and for how long each price vector is offered they both depend on the random trajectory.
Now denote as the random amount of product sold, during the periods that price vector is offered. We know . Here is a random amount, so we cannot directly use concentration inequalities. But we can adapt the trick from Slivkins (2019). Suppose there was a tape of length for each product and each price vector , with each cell independently sampled from the distribution of . This tape serves as a coupling of the random reward: in each period if price vector is offered, we simply generate a demand from the cell of the tape. Now we can use Hoeffding inequality:
Denote the following event :
Using a union bound we have:
because are all less than . The happening of such event means that the realized demands are close to the expected demands, suggesting that LP is a good proxy of any policy .
Now if we focus on the usage of any price vector indexed by , the total expected revenue is . Then the total expected revenue generated during the entire horizon can be upper bounded by
where the last inequality is because . On the other hand, the consumption of inventory must not violate the inventory constraint.
Lower bounding by we have
Now construct the following LP:
If we let and to scale linearly by , then the perturbed term is negligible, and so the optimal solution is scaled by . This means that if we restrict ourselves to a solution that uses no more than non-zero variables, we incur a linear loss, i.e. .
Since from each solution of this perturbed LP, we can find a corresponding discounted solution that is feasible to the DLP. This suggests that , because DLP is a maximization problem.
Putting all together, and conditional on event that happens with probability at least ,
which suggests that . \@endparenv
Let be any policy suggested in Algorithm 1. Let be the associated optimal solution. We prove Theorem 2 by comparing the expected revenue earned by Algorithm 1 against a virtual policy . This virtual policy mimics Algorithm 1 in steps 1–3. But in step 4, it sets the price vector to be for the first periods, then for the next periods, …, for the next periods, and finally for the last periods. Here is a shut-off price, under which .
Policy is virtual because it requires a shut-off price that may or may not be available. Moreover, it requires many price changes, which is more than many changes as suggested in Algorithm 1.
Policy serves to bridge our analysis. It breaks our Theorem 2 into two inequalities that we will prove separately.
For any policy as defined in Algorithm 1 and its associated virtual policy , they both solve the same DLP and have the same optimal solution. To prove the first inequality, note that both and commit to the same prices in the first time periods, and earns the same revenue following each trajectory of random demand. At the end of period , policy still commits to , while policy makes one change and sets . At the end of period , if the selling horizon has ended due to inventory stock-outs, then either policy earns zero revenue, so and makes no difference. If the selling horizon has not ended and there is remaining inventory for any resource, then policy earns non-negative revenue, while earns zero by setting a shut-off price. Following each trajectory of random demand, policy earns more revenue than . As a result, .
To prove the second inequality, we introduce the following notations. Let be an indicator of whether or not policy offers price in period .
is deterministic once policy is determined.
Under policy , following each trajectory of random demand, we define a the length of the effective selling horizon as function of a stopping time :
The effective selling horizon is the minimum between (i) last period before the cumulative demand of any resource exceeds its initial inventory, and (ii) the last period before policy switches to the shut-off price.
Let be the remaining inventory of resource at the end of period . Under this notation, . Note that are random variables, and during the effective selling horizon, inventory updates in the following fashion
Now we calculate the expected revenue.
We explain the inequalities. The first inequality is because we only focus on the revenue earned if event happens, while ignoring the revenue earned if event does not happen; and when event does happen, the maximum amount of any inventory demanded in one single period cannot exceed . The first equality is expanding the expectations, where we use the fact that and are independent, because the indicator is a random event happening up to period , while the summation term is a random amount happening in period . The third inequality is due to (14), is decreasing in , so .
In this block of inequalities (15), the summation term
since the indicators locate which counts into this summation. So the next thing we do is to lower bound the probability.
where the last (strict) inequality is because we plug in .
This above inequality suggests that for any , the expected cumulative demand generated up till period is strictly less than . So we can use concentration inequalities.
where the first inequality is due to union bound; the second inequality is due to Hoeffding inequality, is bounded by , and there are no more than such terms; the third inequality is because we lower bound each by ; the last inequality is when we plug in , and we know that .