Abstract
Reinforcement Learning (RL) has achieved stateoftheart results in domainssuch as robotics and games. We build on this previous work by applying RLalgorithms to a selection of canonical online stochastic optimization problemswith a range of practical applications: Bin Packing, Newsvendor, and VehicleRouting. While there is a nascent literature that applies RL to these problems,there are no commonly accepted benchmarks which can be used to compare proposedapproaches rigorously in terms of performance, scale, or generalizability. Thispaper aims to fill that gap. For each problem we apply both standard approachesas well as newer RL algorithms and analyze results. In each case, theperformance of the trained RL policy is competitive with or superior to thecorresponding baselines, while not requiring much in the way of domainknowledge. This highlights the potential of RL in realworld dynamic resourceallocation problems.
Quick Read (beta)
ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems
Abstract
Reinforcement Learning (RL) has achieved stateoftheart results in domains such as robotics and games. We build on this previous work by applying RL algorithms to a selection of canonical online stochastic optimization problems with a range of practical applications: Bin Packing, Newsvendor, and Vehicle Routing. While there is a nascent literature that applies RL to these problems, there are no commonly accepted benchmarks which can be used to compare proposed approaches rigorously in terms of performance, scale, or generalizability. This paper aims to fill that gap. For each problem we apply both standard approaches as well as newer RL algorithms and analyze results. In each case, the performance of the trained RL policy is competitive with or superior to the corresponding baselines, while not requiring much in the way of domain knowledge. This highlights the potential of RL in realworld dynamic resource allocation problems.
ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems
Bharathan Balaji, Jordan BellMasterson, Enes Bilgin, Andreas Damianou, Pablo Moreno Garcia, Arpit Jain, Runfei Luo, Alvaro Maggiar, Balakrishnan Narayanaswamy, Chun Ye Amazon
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction
Reinforcement learning (RL) has achieved state of the art results in gaming (?), robotics (?) and others. Our work relates to the growing literature of applying RL to optimization problems. (?) show RL techniques produce near optimal solutions for the traveling salesman (TSP) and knapsack problems. (?) use RL to solve TSP and its variants: vehicle routing, orienteering, and a stochastic variant of prizecollecting TSP. (?) solve both static and online versions of the vehicle routing problem. (?) apply RL to the dual sourcing inventory replenishment problem, and further demonstrate results on a real dataset. (?) apply RL to online versions of the knapsack, secretary and adwords problems. (?) apply RL to the beer game problem. (?) use RL for fleet management of taxis on a real life dataset.
Our contribution is to extend the existing RL literature to a set of dynamic resource allocation problems which parallel realworld problems. In particular, we present benchmarks for three classic problems: Bin Packing, Newsvendor and Vehicle Routing. In each case, we show that trained policies from outofthebox RL algorithms with simple 2 layer neural networks are competitive with or superior to established approaches. We open source our code^{1}^{1} 1 https://github.com/awslabs/orrlbenchmarks and parameterize the complexity of the problems in order to encourage fair comparisons of algorithmic contributions. Each environment is implemented with the OpenAI Gym interface (?) and integrated with the RLlib (?) library so researchers can replicate our results, test algorithms and tune hyperparameters.
2 Bin Packing
In the classic version of the bin packing problem, we are given items of different sizes and need to pack them into as few bins as possible. In the online stochastic version, items arrive one at a time with item sizes drawn from a fixed but unknown distribution. Many resource allocation problems in Operations Research and Computer Science face uncertain supply, and can be cast as variants of the online bin packing problem. In warehouse and transportation operations, variants of bin packing can be seen in: the order assignment problem (where we assign orders to fulfillment resources), the tote packing problem (where we fill items as they arrive into totes for shipment), and the trailer truck packing problem. In computing, bin packing problems arise in cloud computing scenarios, where virtual machines with varying memory and cpu requirements are allocated to servers with fixed capacity.
2.1 Problem Formulation
We use a formulation of the bin packing problem similar to ? (?). In the online stochastic bin packing problem, items arrive online, one in each time period $t$, with $t\in \{1,\mathrm{\dots},T\}$. Items can be of different types $j\in \{1,\mathrm{\dots},J\}$. The size of type $j$ is ${s}_{j}$ and the probability that an item is of type $j$ is ${p}_{j}$. Without loss of generality, we assume item types are indexed in the increasing order of their size: $$. Upon arrival, the item needs to be packed into one of the bins, each with size $B$ (we assume that $$). A packing is considered feasible if the total size of the items packed in each bin does not exceed the bin size. The task is to find a feasible packing that minimizes the number of bins used to pack all of the items that arrive within the time horizon. We assume the item sizes ${s}_{j}$ and bin size $B$ are integers. We assume the number of bins one can open is unlimited and denote the sum of item sizes in a bin $k$ as level ${h}_{k}$. After $t$ items have been packed, we denote the number of bins at some level $h$ as ${N}_{h}(t)$, where $h\in \{1,\mathrm{\dots},B\}$.
It can be shown that minimizing the number of nonempty bins is equivalent to minimizing the total waste (i.e. empty space) in the partially filled bins. In real applications (e.g. packing trucks, or virtual machines), there is a dollarvalue cost associated with the consumption of these resources, so at any time horizon our objective is to minimize total waste ${\sum}_{t=0}^{T}W(t)$, where
$$W(t)\triangleq \sum _{h=1}^{B1}{N}_{h}(t)(Bh).$$  (1) 
We use ${W}_{F}^{A}(t)$ to denote the total waste after step $t$ of algorithm $A$ when the input samples come from distribution $F$. For RL, we define the cumulative reward up to time step $t$ to be ${W}_{F}^{RL}(t)$. ? (?) showed that any discrete distribution falls into one of three categories based on expected distribution $E[{W}_{F}^{OPT}(t)]$, where OPT is an offline optimal policy.

1.
Linear waste (LW): $E[{W}_{F}^{OPT}(t)]=\mathrm{\Theta}(t)$, e.g. $B=9$, two item types of size $\{2,3\}$ with probability $\{0.8,0.2\}$ respectively.

2.
Perfectly Packable (PP): $E[{W}_{F}^{OPT}(t)]=\mathrm{\Theta}(\sqrt{t})$, e.g. $B=9$, two item types of size $\{2,3\}$ with probability $\{0.75,0.25\}$ respectively.

3.
PP with bounded waste (BW): $E[{W}_{F}^{OPT}(t)]=\mathrm{\Theta}(1)$, e.g. $B=9$, two item types of size $\{2,3\}$ with probability $\{0.5,0.5\}$ respectively.
We will train an RL policy for each of the three distribution types and compare our policy to the appropriate baseline.
We formulate the bin packing problem as an MDP, where the state ${S}_{t}\in \mathcal{S}$ is the current item with size ${s}_{j}$ and the number of bins at each level is ${N}_{h}(t)$, where $h\in \{1,\mathrm{\dots},B\}$. The action $a$ is to pick a bin level which can fit the item. Thus, the number of actions possible is $B$ with one action for each level and action 0 corresponds to opening a new bin. An episode defines the start and end of simulation. Initially, all the bins are empty. The reward ${R}_{t}$ is the negative of incremental waste as each item is put into a bin ${s}_{j}$. If the item is put into an existing bin, the incremental waste will reduce by item size. If the item is put into a new bin, the waste increases by the empty space left in the new bin. $T$ items need to be placed in the bins, after which the episode ends. We leave varying number of time steps for future work. We impose action masking for infeasible actions, such as picking a level for which bins do not exist yet, by outputting a probability for every action (regardless of feasibility) but multiplying the infeasible action probabilities by zero so they are never executed.
2.2 Related Work
Bin packing is a wellstudied problem in the operations research and computer science literature. The problem is already NPhard in its basic form. As a result, many of the classical approaches to bin packing analyze the performance of approximation algorithms. We refer the readers to the survey (?) for algorithmic approaches to classical bin packing and its generalizations.
For online bin packing, a simple heuristic – Best Fit – is known to use at most $1.7$ times the optimal number of bins in the worst case (?). Best Fit places an item in a bin where, if the item were to be packed, would leave the least amount of space. Another competitive heuristic is Sum of Squares (SS) heuristic (?). In particular, SS is proven to be asymtotically optimal (up to constants) as the episode length grows.
The simple heuristics described above are distribution agnostic. More sophisticated algorithms learn an empirical estimate of the item size distribution, leverage such distribution to solve a linear program, and use its dual to guide the online policy (?). This approach has been used to solve online packing and covering problems (?; ?).
2.3 Baseline Algorithms
We use the Sum of Squares (SS) heuristic and Best Fit (BF) as our baseline algorithms. When the $t$th item of size $s$ arrives, SS picks a bin of level ${h}^{*}$ that minimizes the value of the following sumofsquares potential:
$$\sum _{h=1}^{B1}{({N}_{h}(t))}^{2}.$$  (2) 
It can be shown that minimizing (2) is equivalent to:
$${h}^{*}=\underset{h:{N}_{h}(t1)>0\text{and}h+s\le B}{\mathrm{arg}\mathrm{min}}[{N}_{h+s}(t1){N}_{h}(t1)],.$$  (3) 
where, ${N}_{0}={N}_{B}=0$. Intuitively, SS tries to equalize the number of bins at each level. Due its simplicity, we implemented (3) version of SS.
BF selects a bin at the highest level that can fit the item:
$${h}^{*}=\underset{h:{N}_{h}(t1)>0\text{and}h+s\le B}{\mathrm{arg}\mathrm{max}}h$$  (4) 
2.4 Reinforcement Learning Algorithm
We use the Proximal Policy Optimization (PPO) algorithm (?). We use a twolayer neural network with 256 hidden nodes each for both the actor and the critic. The input to both actor and critic network is the state, the output of the actor network is a vector giving the probabilities of taking any action in the action space, and the output of the critic network is predicted cumulative discounted reward from that state. During training, the agent explores the state space by sampling from the probability distribution of the actions generated by the policy network. We mask actions by reducing the probability of invalid actions to 0. We use a single machine with 4 GPUs and 32 CPUs for our experiments. We list all the hyperparameters used in Appendix A.
2.5 Results
For each sample item size distribution (BW, PP, LW), we train the RL algorithm (PPO) and compare to the baseline algorithms (SS and BF). We consider two variations, bin size of 9, with 100 items and distributions listed in section 2.1, and bin size of 100, with 1000 items and the following item size distribution:

1.
item sizes: $[1,2,3,4,5,6,7,8,9]$

2.
item probabilities for BW:
$[0.14,0.10,0.06,0.13,0.11,0.13,0.03,0.11,0.19]$ 
3.
item probabilities for PP:
$[0.06,0.11,0.11,0.22,0,0.11,0.06,0,0.33]$ 
4.
item probabilities for LW: $[0,0,0,1/3,0,0,0,0,2/3]$.
Figure 1 plots the reward earned by the RL policy in training (blue) vs the Best Fit baseline (red) for bin size 100 and different item size distributions (BW, PP, and LW) as a function of training time (measured in minutes). The solid lines represent the mean reward of each policy, and the shaded bands represent the min/max rewards. By the end of training, RL either matches or outperforms the baseline policy for all three item size distributions. In particular, the reward gap between RL and baseline is the largest for LW distribution (which is expected, as both BF and SS are known to be suboptimal for LW distribution).
In Table 1, we inspect numerically the trained RL policy vs. baseline for bin size 100. Supporting what we observed in the initial figures, this table shows the final RL policy outperforms or matches the baseline for each distribution.
We test generalization of the RL policy by evaluating the trained policy with a different item distribution than the one it was trained on. For PP and BW distributions, the trained policies translate well. Both the PP and BW policies perform as well as the baseline solutions for the LW distribution. The policy trained on the LW distribution generalizes reasonably well but does not do as well as the baseline solutions in the BW and PP distributions. We did observe overfitting if we pick model iterations from much later in training. We leave the study of overfitting and generalization across distributions as future work. A note on scaling: the training time for bin size 100 is about 3x, 4x and 10x more than bin size 9 for PP, BW and LW respectively. The bin size 9 results can be found in the supplementary material.
Algorithm  Perfect Pack  Bounded Waste  Linear Waste  

$\mu $  $\sigma $  $\mu $  $\sigma $  $\mu $  $\sigma $  
RL with PP  49.0  29.5  48.0  29.5  1358  44.2 
RL with BW  47.6  29.3  53.9  26.4  1368  48.0 
RL with LW  258.6  69.3  143.9  84.9  880.2  43 
SS  56.54  28.9  56.61  30.2  2091  92 
Best Fit  52.01  29.5  51.4  28.9  1314  53 
Finally, we inspect the relative structure of the policies to ensure that RL is learning a sensible solution. In particular, we plot the state variable values as a function of the number of steps in an episode. Intuitively, the integral of these plots represents the waste, which we want to minimize. An optimal policy should show a (relatively) flat surface. We use bin size of 9 for this analysis for ease of manual inspection and study the linear waste distribution that highlights the difference between the Sum of Squares baseline and RL distinctly. From Figure 2, we see that the baseline policy leaves more open bins at a lower fullness, whereas RL only leaves open bins at level 8 (which cannot be closed once they reach that level). This indicates that the learned RL polocy is reasonable. For other distributions, the graphs for both the baseline and RL policy look similar to each other.
3 MultiPeriod Newsvendor with Lead Times
The Newsvendor problem (see e.g. ? (?)) is a seminal problem in inventory management wherein we must decide on an ordering decision (how much of an item to purchase from a supplier) to cover a single period of uncertain demand. The objective is to tradeoff the various costs incurred and revenues achieved during the period, usually consisting of sales revenue, purchasing and holding costs, loss of goodwill in the case of missed sales, and the terminal salvage value of unsold items.
In practice, decisions are rarely isolated to a single period, and they are repeatedly and periodically taken and thus have a downstream impact. This makes the problem nontrivial and has no known optimal solution, as compared to the singleperiod Newsvendor which has a known optimal solution when the demand distribution is known. Additionally, purchased units do not, in general, arrive quasiinstantaneously, but rather after a few periods of transit from the vendor to their final destination, known as the lead time. The presence of lead times further complicates the problem. Solving the multiperiod newsvendor problem with lead times and lost sales is a notoriously difficult problem (?). It requires keeping track of orders placed in different periods, leading to what is known as the curse of dimensionality, rendering any exact solution impractical even for small lead times of 2 and 3 periods, and outright infeasible at higher dimensions. As a result, the problem forms a good testbed for RL algorithms given that the observation of rewards is delayed by the lead time and that it can be formulated as a Markov Decision Problem. A number of heuristics have been developed for the lost sales problem, often based on orderupto level policies for the equivalent model with backlogged demand. Comparisons in the performance of these two policies have been studied (?), and it has been shown that orderupto policies are asymptotically optimal (?), thus making for good benchmark policies.
3.1 Problem formulation
We consider the stationary, singleproduct, multiperiod dynamic inventory management problem with vendor lead time (VLT) and stochastic demand. Here, the VLT $l$ refers to the number of time steps between the placement and receipt of an order. The demand $D$ is assumed to be stationary and Poisson distributed with mean $\mu $. Items are purchased at a cost $c$ and sold at a price $p$, and incur a penalty for lost sales $k$ for each unit of unmet demand while any unit left over at the end of a period incurs a holding cost $h$. Items do not perish. A discount factor $\gamma $ is used. No terminal value is awarded for the inventory state at end of episode.
The problem is formulated as a Markov Decision Process:
 State:

The state $S$ of the problem is given by
$S=(p,c,h,k,\mu ,{x}_{0},\mathrm{\dots},{x}_{l1})$ where ${x}_{0}$ is the onhand inventory, ${x}_{1}$ the units to be received one period hence, and so on.
 Action:

In each period the state of the system is observed and an action $A=q$ is taken, consisting of the size of the order placed and to arrive $l$ time periods later.
 Reward:

We first incur the purchasing cost corresponding to the procured units given the action $a$. A realization $d$ of the demand $D$ (Poisson distributed with mean $\mu $) is then observed, and demand is satisfied as much as is possible given onhand levels. Missed sales incur a loss of goodwill $k$ per unit, while leftover units incur a holding cost $h$:
$R$ $=p\mathrm{min}({x}_{0},d)cah{({x}_{0}d)}^{+}k{(d{x}_{0})}^{+}.$ where ${(x)}^{+}=\mathrm{max}(x,0)$.
 Transition:

The state of the system $S$ is then updated to ${S}_{+}$ by moving all pipeline units downstream and incorporating the newly purchased units:
${S}_{+}$ $=(p,c,h,k,\mu ,{({x}_{0}d)}^{+}+{x}_{1},{x}_{2},\mathrm{\dots},{x}_{l1},a).$ We do not impose action masking because there is no infeasible action in the prespecified, positive continuous action space. We convert the continuous buy quantity to integer by postprocess rounding.
3.2 Related work
Datacentric approaches (?) and reinforcement learning approachse (?) have recently been suggested for the newsvendor problem. These have so far still remained focused on the single period problem and often trying to learn some of the inputs, such as demand. A few other papers have considered Reinforcement Learning in the context of inventory management, such as (?), where a dual sourcing problem is tackled using RL.
3.3 Baseline Algorithm
As noted in the beginning of this section, it is impractical or even infeasible to solve the multistage newvendor problem exactly. However, it is possible to use heuristics that provide good approximations to the optimal solution. In particular, a way to tackle the problem is to approximate it by its backlogging counterpart, where orders are not lost if unsatisfied, for which a closed form solution of the optimal policy exists in the form of an orderupto policy characterized by the following critical ratio:
$CR={\displaystyle \frac{p\gamma c+k}{p\gamma c+k+h}}.$ 
As a result, letting ${z}^{*}={F}_{l}^{1}(CR)$, where ${F}_{l}$ is the cumulative distribution function of the $l$ period demand, the policy is given by:
$a$  $={\left({z}^{*}{\displaystyle \sum _{i=0}^{l1}}{x}_{i}\right)}^{+}.$ 
3.4 Reinforcement Learning Algorithm
We use Trust Region Policy Optimization (TRPO) (?) as implemented in the RLLab package (?), where the policy is represented by a neural network. We use a single machine with 4 GPUs and 32 CPUs for our experiments. We use a neural network of size (64,32) and the hyperparameters presented in Appendix C.
3.5 Results
We present the results obtained using a VLT of 5 and time horizon of 40. The economic parameters were chosen so that $p,c\in [0,100]$, $h\in [0,5]$ and $k\in [0,10]$, while the demand mean $\mu $ was such that $\mu \in [0,200]$.
We sampled problem parameters as follows: $p\sim U[0,100]$, $c\sim U[0,p]$, $h\sim U[0,\mathrm{min}(c,5)]$, $k\sim U[0,10]$ and $\mu \sim U[0,200]$ for the economic and demand parameters; where $U[a,b]$ denotes a uniformly random variable between $a$ and $b$. The initial state was simply set to be $\mathrm{\U0001d7ce}$. Figure 3 compares the results obtained by the RL algorithm to the baseline. The RL solution eventually beats the benchmark.
While solving this problem numerically is intractable, the optimal inventory policy structures are well known. It is thus of interest to check whether their properties are being learned by the RL algorithm. Given the dimension of the problem, we cannot observe the entire policy, but can investigate slices thereof. We thus fix price, cost, holding cost, penalty for lost sale and mean demand to 50, 25, 0.5, 5 and 100, respectively, and plot the optimal policy in the space $(0,0,0,{x}_{3},{x}_{4})$ in Figure 4, where ${x}_{3}$ and ${x}_{4}$ stand for the inventory to arrive in three and four periods respectively. The figure shows contour curves of the buying quantity as a function of the inventory state. Intuitively, a good policy will buy less if we already have inventory onhand (or in the pipeline). Visually, this should look like a smooth, decreasing frontier. We observe that the algorithm is learning this desired policy structure over the training period and we can start to observe monotonicity of the policy along most directions.
4 Vehicle Routing Problem
One of the most widely studied problems in combinatorial optimization is the traveling salesman problem (TSP), which involves finding the shortest route that visits each node in a graph exactly once and returns to the starting node. TSP is an NPhard problem and has a variety of practical applications from logistics to DNA sequencing. The vehicle routing problem (VRP) is a generalization of TSP where one or more vehicles are expected to visit the nodes in a graph, for example to satisfy customer demand. VRP is also a wellstudied topic and has several applications, especially in supply chain and logistics. An important extension of VRP is where some of the information about the graph is revealed over time, such as demand at each node and travel time. This class of VRP is called dynamic VRP (DVRP, also known as realtime or online VRP). Stochastic VRP (SVRP) is where one or more problem parameters are stochastic with some known probability distributions (as opposed to arbitrary or adversarial distributions). In many reallife applications, the relevant VRP is both stochastic and dynamic (SDVRP), which is also focus of this work. We formulate a variant of SVRP and compare solution approaches from the Operations Research (OR) and Reinforcement Learning (RL) literature.
4.1 Problem Formulation
We consider a VRP variant that is of an ondemand delivery driver. Orders arrive on the phone app of the driver in a dynamic manner throughout the problem horizon. Each order has a delivery charge (reward) known to the driver at the time of order creation, and it is assigned to a pickup location (e.g. restaurant) in the city. “City” here refers to the Euclidean space in a grid map where the VRP environment is created. The city consists of mutually exclusive zones that generate orders at different rates. At each time step, an order generated with a constant probability and assigned to a zone (i.e. ${p}_{1}=0.5,{p}_{2}=0.3,{p}_{3}=0.1,{p}_{4}=0.1$ for zones 1,…,4). Orders have rewards that come from zonespecific truncated normal distribution with different ranges (i.e. with minimum and maximum dollar values of [8,12], [5,8], [2,5] and [1,3] for each zone, respectively). Orders have delivery time windows, which is within 60 minutes from the creation of the order. The driver has to accept an order and pick up the package from a given location prior to delivery. Orders that are not accepted disappear probabilistically (i.e. with a timeout probability of 0.15 per time step) and assumed to be taken by some other competitor driver in the city. The vehicle has a capacity limit of 4 orders on the vehicle, but the driver can accept unlimited orders and plan the route accordingly. The driver incurs a cost per time step and unit distance traveled (0.1 for both), representing the money value of time and travel costs. The driver’s goal is to maximize the total net reward over an episode of 1000 time steps. This version of VRP is known as stochastic and dynamic capacitated vehicle routing problem with pickup and delivery, time windows and service guarantee. We choose this particular variant, which is less studied in the literature, because it more closely resembles realworld instances of the problem and gives us higher confidence that RL can generalize beyond our toy setup.
 State:

We include pickup location ${p}_{t}$, driver info, and order info. Driver info contains the driver’s position ${h}_{t}$ and the capacity left ${c}_{t}$. Order info contains the orders’ location ${l}_{t}$, status ${w}_{t}$ (open, accepted, picked up or delivered/inactive), the time elapsed since each order’s generation ${e}_{t}$ and the corresponding dollar value ${v}_{t}$. Thus, the state is ${S}_{t}=({p}_{t},{h}_{t},{c}_{t},{l}_{t},{w}_{t},{e}_{t},{v}_{t})$.
 Action

The agent chooses an action ${A}_{t}$ from five options – accept the open order $i\in P$, pick up the accepted order $i\in A$, pick up the accepted order $i\in A$, the pickup location $j\in R$, or wait and stay unmoved.
 Reward:

The reward ${R}_{t}$ is the total value of all delivered orders ${f}_{t}$ minus the cost ${q}_{t}$. ${f}_{t}$ is divided into $3$ equal parts for reward shaping: when the order gets accepted, picked up, and delivered respectively. Thus we have:
$${R}_{t}=\frac{1}{3}\left({\mathrm{\U0001d7d9}}_{accepted}+{\mathrm{\U0001d7d9}}_{pickedup}+{\mathrm{\U0001d7d9}}_{delievered}\right){f}_{t}{q}_{t},$$ where ${q}_{t}=({q}_{time}+{q}_{move}+{q}_{failure})$. ${q}_{time}$ is the time cost, ${q}_{move}$ is the moving cost (per time step). ${q}_{failure}$ is a large penalty ($50$) if the agent accepts an order but fails to deliver within the promised time.
The vehicle’s capacity remains unchanged if an order is accepted but not picked up. In effect, this grants the agent the flexibility to accept more orders than available capacity, which can be picked up later when space allows. The action of heading to a specific pickup location enables the agent to learn to stay near popular pick up locations. We impose action masking during the policy training. The agent cannot perform the following invalid actions: (i) pick up an order when its remaining capacity is $0$; (ii) pick up an order that is not yet accepted; (iii) deliver an order that is not in transit.
4.2 Related Work
There is a substantial literature on VRP (?). The closest VRP variant to the problem considered in this paper is the Pickup and Delivery Problem with Time Windows (PDPTW) (?), which has some additional complexities over vanilla VRP. Due to such complexities, there are fewer exact solution approaches (?; ?), and the majority of the literature focuses on heuristics. When the problem is also stochastic and dynamic, exact solution methods become intractable except for very specific problem settings. In such cases, anticipatory algorithms that simulate sample future scenarios and merge solutions to those samples are a common choice (?).
Reinforcement Learning (RL) methods have been successfully used for solving the Traveling Salesman Probelm (TSP). ? (?) employ a pointer network to optimize the policy, and train an actorcritic algorithm with the negative tour length as the reward signal. ? (?) develop a single model based on graph embeddings. They use the DQN algorithm to train a greedy policy and graph embedding network simultaneously. For VRP, ? (?) utilize the transformer neural network architecture to develop a model fully based on attention layers. Their proposed model is trained by policy gradients with a greedy baseline, and evaluated on both standard Capacitated VRP (CVRP) and Split Deliverry VRP (SDVRP). ? (?) further improve the algorithm using embedded inputs and allow the customers and their demands to be stochastic.
4.3 Baseline Algorithm
We modify the classical threeindex Mixed Integer Programming (MIP) formulation (?; ?). This deterministic MIP is solved for the available orders in the environment. It is further resolved when a new order arrives, if one of the existing orders expires, or when all of the actions are executed. When we solve the MIP, the orders that had been already accepted or were in transit are modeled as starting conditions. The details of our MIP model is in Appendix D. We leave anticipatory models to future work.
4.4 Reinforcement Learning Algorithm
To train the policy, we apply the APEX DQN algorithm (?) due to its ability to scale by generating more experience replays and picking from them in a distributed prioritized fashion^{2}^{2} 2 We also applied PPO with default hyperparameters provided in RLLib. The reward increased much slower than that of APEXDQN and was not able to beat the baseline after 1 day of training.. We use a twolayer neural network with 512 hidden units each. We list all hyperparameters in Appendix E. We use a machine with 1 GPU and 8 CPUs for our experiments.
4.5 Results
For multiple problem scales determined by map size ($\{5\times 5,8\times 8\}$), maximum number of orders ($order\in \{5,10\}$) and number of pickup locations in the map ($n\in \{2,3\}$), we conduct experiments to compare the behavior of RL and the MIP baseline solutions. We examine the trained RL policy’s ability to generalize to different order distributions. The hyperparameters used for algorithm training are taken from RLLib robotics examples without finetuning. Overall, the RL approach outperforms the baseline across different instance sizes, and generalizes well for unseen order patterns.
Figure 4(a)4(b) compares the episodic rewards for the RL policy and the baseline algorithm during training. The shaded band around the mean line shows the minimum and maximum rewards. For readability, the graphs are clipped to skip the initial $3.5$ hours of training as the rewards are highly negative and skew the Yaxis scale. With larger map size or higher order number, the training time required for the agent to achieve rewards equivalent to baseline is higher. This is expected as both the observation and action space increase, the agent requires more exploration to converge to a reasonable policy. Even after three days of training, the rewards for larger instances keep growing gradually. The agent slowly learns to fully utilize the vehicle capacity and to not accept orders which are likely to incur penalty.
As the agent is trained longer, there is potential for the policy to overfit. In order to test generalizability, we train another policy with a shifted hot orderzone distribution $(0.1,0.5,0.3,0.1)$, and evaluate against the baseline results both using the original orderzone distribution $(0.5,0.3,0.1,0.1)$. Table 2 summarizes the evaluation results. It is observed that this policy is able to outperform the baseline consistently during evaluation phase.
We also present the rewards with and without the order miss penalty ${q}_{failure}$ for the same trained policy to further understand the agent’s behavior about order delivery misses. The reward values are close for problems with fewer number of pickup locations and fewer orders. As the number of pickup locations becomes larger, the gap between the rewards increases. One explanation is the agent struggles with the increased complexity of order deliveries from different pickup locations, and its action often changes in the middle of a delivery, so the likelihood of missing the order delivery increases. This behavior is also seen if the number of orders is higher. Even though the RL agent reward is better than the baseline, there is still scope for improvement by reducing the number of order delivery misses.
Problem Instance  RL Evaluation Reward  MIP Reward  

Without ${q}_{failure}$  With ${q}_{failure}$  



595.91  



642.62  



640.01  



410.58  



246.25 
5 Conclusion and Future Work
In this paper, we have established Deep Reinforcement Learning (DRL) benchmarks for three canonical dynamic resource allocation problems: Bin Packing, Newsvendor, and Vehicle Routing. We formulated a Markov Decision Process for each problem, and compared established algorithms with vanilla RL techniques. In each case, RL policy either outperforms or is competitive with the baseline. While we do not overcome the NPhardness of the problems, as wallclock training time scales with problem size, we find that DRL is a good tool for these problems. These results illustrate the potential value of RL for a wide range of realworld industrial online stochastic problems, from order assignment, to retail buying, to realtime routing. Our experiments indicate the following issues as important for making RL solutions more practical in the future: building effective simulators, learning from historical data, initialization of the RL model, overfitting to a particular distribution and enforcement of constraints (e.g. via action masking). We used outofthebox RL algorithms, with almost no problemspecific tweaking and simple 2layer neural networks. Further research can add value by testing various RL algorithms, neural network structures, etc. and seeing their relative value in each problem especially as problem complexity scales up (i.e., solving realworld instances of these problems will likely require innovation on the RL side). In this paper we only looked at canonical, theoretical models. Further research should endeavor to apply these RL techniques to realworld industrial problems.
References
Appendix A Bin Packing  HyperParameters
Discount factor  0.995  KL coefficient  1.0 
Experience Buffer  320000  Learning rate  0.0001 
SGD Minibatch  32768  Epochs  10 
Entropy coefficient  0  # Workers  31 
Episode length  10000  clip param  0.3 
Appendix B Bin Packing Results  Bin Size of 9
Algorithm  Perfect Pack  Bounded Waste  Linear Waste  

$\mu $  $\sigma $  $\mu $  $\sigma $  $\mu $  $\sigma $  
RL with PP  39.4  15.1  29.0  5.7  106.8  70.8 
RL with BW  57K  24K  5.06  2.77  79K  8.5K 
RL with LW  47.1  7.7  41.3  11.8  71.8  10 
SS  50.2  28.6  17.27  3.21  212.2  68.7 
Best Fit  123.7  8.3  127.49  9.6  130.6  7.7 
Appendix C Newsvendor  HyperParameters
Learning rate  0.00001 

SGD MiniBatch Size  32768 
Train Batch Size  320000 
Episode length  40 
Discount factor  1 
Epochs  5 
Neural Network  Fully connected, 64x32 
Action Space  Normalized between 0 to 1 
Appendix D VRP baseline MIP formulation
Sets
$V$$:$  Current vehicle location, $V=\{0\}$ 

$P$$:$  Pickup nodes (copies of the restaurant nodes, associated with the orders that are not in transit) 
$D$$:$  Delivery nodes representing the orders that are not in transit, $D=\{jj=i+n,i\in P,n=P\}$ 
$A$$:$  Nodes representing the orders that are accepted by the driver; $A\subset D$ 
$T$$:$  Delivery nodes representing the orders that are in transit 
$R$$:$  Nodes representing the restaurants, used for final return) 
$N$$:$  Set of all nodes in the graph, $N=V\cup P\cup D\cup T\cup R$ 
$E$$:$  Set of all edges, $E=\{(i,j),\forall i,j\in N\}$ 
Decision variables
${x}_{ij}$$:$  Binary variable, 1 if the vehicle uses the arc from node $i$ to $j$, 0 otherwise; $i,j\in N$ 

${y}_{i}$$:$  Binary variable, 1 if the order $i$ is accepted, 0 otherwise; $i\in P$ 
${Q}_{i}$$:$  Auxiliary variable to track the capacity usage as of node $i$; $i\in N$ 
${B}_{i}$$:$  Auxiliary variable to track the time as of node $i$; $i\in N$ 
Parameters
$n$$:$  Number of orders available to pick up, $n=P$ 

${c}_{ij}$$:$  Symmetric Manhattan distance (in miles) matrix between node $i$ and $j$, $(i,j)\in E$ 
${q}_{i}$$:$  Supply (demand) at node $i$, ${q}_{0}=T;{q}_{i}=1,\forall i\in P;{q}_{i}=1,\forall i\in D\cup T;{q}_{i}=0\in R$ 
${l}_{i}$$:$  Remaining time to deliver order $i$, $i\in D\cup T$ 
$m$$:$  Travel cost per mile 
${r}_{i}$$:$  Revenue for order associated with pick up node $i$, $i\in P$ 
$U$$:$  Vehicle capacity 
$M$$:$  A very big number 
$t$$:$  Time to travel one mile 
$d$$:$  A constant positive service time spent on accept, pickup, delivery 
Model
$$\begin{array}{cccccc}& \hfill \underset{x,y,Q,B}{\mathrm{max}}& \multicolumn{3}{c}{\sum _{i\in P}{r}_{i}{y}_{i}m\sum _{(i,j)\in E}{c}_{ij}{x}_{ij}}& \\ & \hfill \text{s.t.}\mathit{\hspace{1em}\hspace{1em}}\sum _{j\in N}{x}_{ij}& \hfill =\hfill & {y}_{i}\hfill & \hfill \forall i\in P\hfill & \\ & \hfill \sum _{j\in N}{x}_{ij}\sum _{j\in N}{x}_{i+n,j}& \hfill =\hfill & 0\hfill & \hfill \forall i\in P\hfill & \\ & \hfill {y}_{i}& \hfill =\hfill & 1\hfill & \hfill \forall i\in A\hfill & \\ & \hfill \sum _{j\in N}{x}_{ij}& \hfill =\hfill & 1\hfill & \hfill \forall i\in V\cup T\hfill & \\ & \hfill \sum _{i\in N\setminus R}\sum _{j\in R}{x}_{ij}& \hfill =\hfill & 1\hfill & & \\ & \hfill \sum _{j\in N\setminus R}{x}_{ji}\sum _{j\in N}{x}_{ij}& \hfill =\hfill & 0\hfill & \hfill \forall i\in P\cup D\cup T\hfill & \\ & \hfill {Q}_{i}+{q}_{j}M(1{x}_{ij})& \hfill \le \hfill & {Q}_{j}\hfill & \hfill \forall i,j\in N\hfill & \\ & \hfill \mathrm{max}(0,{q}_{i})& \hfill \le \hfill & {Q}_{i}\hfill & \hfill \forall i\in N\hfill & \\ & \hfill \mathrm{min}(U,U+{q}_{i})& \hfill \ge \hfill & {Q}_{i}\hfill & \hfill \forall i\in N\hfill & \\ & \hfill {B}_{i}+d+{c}_{ij}tM(1{x}_{ij})& \hfill \le \hfill & {B}_{j}\hfill & \hfill \forall i,j\in N\hfill & \\ & \hfill {B}_{i}+{c}_{i,i+n}tM(1{y}_{i})& \hfill \le \hfill & {B}_{i+n}\hfill & \hfill \forall i\in P\hfill & \\ & \hfill d\sum _{i\in P\setminus A}{y}_{i}& \hfill =\hfill & {B}_{0}\hfill & & \\ & \hfill {B}_{i}& \hfill \le \hfill & {l}_{i}\hfill & \hfill \forall i\in D\cup T\hfill & \\ & \hfill {x}_{ij},{y}_{i}& \hfill \in \hfill & \{0,1\}\hfill & \hfill \forall i,j\in N\hfill & \end{array}$$  (5) 
Appendix E Vehicle Routing Problem  HyperParameters
Replay buffer alpha  0.5  # steps for Q  3 
Replay buffer eps  0.1  Learning rate  1e3 
Final explore eps  0.01  Adam epsilon  1.5e4 
Replay buffer size  1e6  # Workers  7 
Episode length  1000  Training Batch  512 