ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems

  • 2019-11-24 23:49:48
  • Bharathan Balaji, Jordan Bell-Masterson, Enes Bilgin, Andreas Damianou, Pablo Moreno Garcia, Arpit Jain, Runfei Luo, Alvaro Maggiar, Balakrishnan Narayanaswamy, Chun Ye
  • 1


Reinforcement Learning (RL) has achieved state-of-the-art results in domainssuch as robotics and games. We build on this previous work by applying RLalgorithms to a selection of canonical online stochastic optimization problemswith a range of practical applications: Bin Packing, Newsvendor, and VehicleRouting. While there is a nascent literature that applies RL to these problems,there are no commonly accepted benchmarks which can be used to compare proposedapproaches rigorously in terms of performance, scale, or generalizability. Thispaper aims to fill that gap. For each problem we apply both standard approachesas well as newer RL algorithms and analyze results. In each case, theperformance of the trained RL policy is competitive with or superior to thecorresponding baselines, while not requiring much in the way of domainknowledge. This highlights the potential of RL in real-world dynamic resourceallocation problems.


Quick Read (beta)

ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems

Bharathan Balaji, Jordan Bell-Masterson, Enes Bilgin, Andreas Damianou, Pablo Moreno Garcia,
Arpit Jain, Runfei Luo, Alvaro Maggiar, Balakrishnan Narayanaswamy, Chun Ye

Reinforcement Learning (RL) has achieved state-of-the-art results in domains such as robotics and games. We build on this previous work by applying RL algorithms to a selection of canonical online stochastic optimization problems with a range of practical applications: Bin Packing, Newsvendor, and Vehicle Routing. While there is a nascent literature that applies RL to these problems, there are no commonly accepted benchmarks which can be used to compare proposed approaches rigorously in terms of performance, scale, or generalizability. This paper aims to fill that gap. For each problem we apply both standard approaches as well as newer RL algorithms and analyze results. In each case, the performance of the trained RL policy is competitive with or superior to the corresponding baselines, while not requiring much in the way of domain knowledge. This highlights the potential of RL in real-world dynamic resource allocation problems.

ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems

Bharathan Balaji, Jordan Bell-Masterson, Enes Bilgin, Andreas Damianou, Pablo Moreno Garcia, Arpit Jain, Runfei Luo, Alvaro Maggiar, Balakrishnan Narayanaswamy, Chun Ye Amazon

Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

1 Introduction

Reinforcement learning (RL) has achieved state of the art results in gaming (?), robotics (?) and others. Our work relates to the growing literature of applying RL to optimization problems. (?) show RL techniques produce near optimal solutions for the traveling salesman (TSP) and knapsack problems. (?) use RL to solve TSP and its variants: vehicle routing, orienteering, and a stochastic variant of prize-collecting TSP. (?) solve both static and online versions of the vehicle routing problem. (?) apply RL to the dual sourcing inventory replenishment problem, and further demonstrate results on a real dataset. (?) apply RL to online versions of the knapsack, secretary and adwords problems. (?) apply RL to the beer game problem. (?) use RL for fleet management of taxis on a real life dataset.

Our contribution is to extend the existing RL literature to a set of dynamic resource allocation problems which parallel real-world problems. In particular, we present benchmarks for three classic problems: Bin Packing, Newsvendor and Vehicle Routing. In each case, we show that trained policies from out-of-the-box RL algorithms with simple 2 layer neural networks are competitive with or superior to established approaches. We open source our code11 1 https://github.com/awslabs/or-rl-benchmarks and parameterize the complexity of the problems in order to encourage fair comparisons of algorithmic contributions. Each environment is implemented with the OpenAI Gym interface (?) and integrated with the RLlib (?) library so researchers can replicate our results, test algorithms and tune hyperparameters.

2 Bin Packing

In the classic version of the bin packing problem, we are given items of different sizes and need to pack them into as few bins as possible. In the online stochastic version, items arrive one at a time with item sizes drawn from a fixed but unknown distribution. Many resource allocation problems in Operations Research and Computer Science face uncertain supply, and can be cast as variants of the online bin packing problem. In warehouse and transportation operations, variants of bin packing can be seen in: the order assignment problem (where we assign orders to fulfillment resources), the tote packing problem (where we fill items as they arrive into totes for shipment), and the trailer truck packing problem. In computing, bin packing problems arise in cloud computing scenarios, where virtual machines with varying memory and cpu requirements are allocated to servers with fixed capacity.

2.1 Problem Formulation

We use a formulation of the bin packing problem similar to ? (?). In the online stochastic bin packing problem, items arrive online, one in each time period t, with t{1,,T}. Items can be of different types j{1,,J}. The size of type j is sj and the probability that an item is of type j is pj. Without loss of generality, we assume item types are indexed in the increasing order of their size: s1<s2<<sJ. Upon arrival, the item needs to be packed into one of the bins, each with size B (we assume that sJ<B<). A packing is considered feasible if the total size of the items packed in each bin does not exceed the bin size. The task is to find a feasible packing that minimizes the number of bins used to pack all of the items that arrive within the time horizon. We assume the item sizes sj and bin size B are integers. We assume the number of bins one can open is unlimited and denote the sum of item sizes in a bin k as level hk. After t items have been packed, we denote the number of bins at some level h as Nh(t), where h{1,,B}.

It can be shown that minimizing the number of non-empty bins is equivalent to minimizing the total waste (i.e. empty space) in the partially filled bins. In real applications (e.g. packing trucks, or virtual machines), there is a dollar-value cost associated with the consumption of these resources, so at any time horizon our objective is to minimize total waste t=0TW(t), where

W(t)h=1B-1Nh(t)(B-h). (1)

We use WFA(t) to denote the total waste after step t of algorithm A when the input samples come from distribution F. For RL, we define the cumulative reward up to time step t to be WFRL(t). ? (?) showed that any discrete distribution falls into one of three categories based on expected distribution E[WFOPT(t)], where OPT is an offline optimal policy.

  1. 1.

    Linear waste (LW): E[WFOPT(t)]=Θ(t), e.g. B=9, two item types of size {2,3} with probability {0.8,0.2} respectively.

  2. 2.

    Perfectly Packable (PP): E[WFOPT(t)]=Θ(t), e.g. B=9, two item types of size {2,3} with probability {0.75,0.25} respectively.

  3. 3.

    PP with bounded waste (BW): E[WFOPT(t)]=Θ(1), e.g. B=9, two item types of size {2,3} with probability {0.5,0.5} respectively.

We will train an RL policy for each of the three distribution types and compare our policy to the appropriate baseline.

We formulate the bin packing problem as an MDP, where the state St𝒮 is the current item with size sj and the number of bins at each level is Nh(t), where h{1,,B}. The action a is to pick a bin level which can fit the item. Thus, the number of actions possible is B with one action for each level and action 0 corresponds to opening a new bin. An episode defines the start and end of simulation. Initially, all the bins are empty. The reward Rt is the negative of incremental waste as each item is put into a bin sj. If the item is put into an existing bin, the incremental waste will reduce by item size. If the item is put into a new bin, the waste increases by the empty space left in the new bin. T items need to be placed in the bins, after which the episode ends. We leave varying number of time steps for future work. We impose action masking for infeasible actions, such as picking a level for which bins do not exist yet, by outputting a probability for every action (regardless of feasibility) but multiplying the infeasible action probabilities by zero so they are never executed.

(a) RL vs baseline for BW distribution
(b) RL vs baseline for PP distribution
(c) RL vs baseline for LW distribution
Figure 1: Comparison of episodic rewards between RL and Best Fit baseline during training.

2.2 Related Work

Bin packing is a well-studied problem in the operations research and computer science literature. The problem is already NP-hard in its basic form. As a result, many of the classical approaches to bin packing analyze the performance of approximation algorithms. We refer the readers to the survey (?) for algorithmic approaches to classical bin packing and its generalizations.

For online bin packing, a simple heuristic – Best Fit – is known to use at most 1.7 times the optimal number of bins in the worst case (?). Best Fit places an item in a bin where, if the item were to be packed, would leave the least amount of space. Another competitive heuristic is Sum of Squares (SS) heuristic (?). In particular, SS is proven to be asymtotically optimal (up to constants) as the episode length grows.

The simple heuristics described above are distribution agnostic. More sophisticated algorithms learn an empirical estimate of the item size distribution, leverage such distribution to solve a linear program, and use its dual to guide the online policy (?). This approach has been used to solve online packing and covering problems (??).

2.3 Baseline Algorithms

We use the Sum of Squares (SS) heuristic and Best Fit (BF) as our baseline algorithms. When the tth item of size s arrives, SS picks a bin of level h* that minimizes the value of the following sum-of-squares potential:

h=1B-1(Nh(t))2. (2)

It can be shown that minimizing (2) is equivalent to:

h*=argminh:Nh(t-1)>0andh+sB[Nh+s(t-1)-Nh(t-1)],. (3)

where, N0=NB=0. Intuitively, SS tries to equalize the number of bins at each level. Due its simplicity, we implemented (3) version of SS.

BF selects a bin at the highest level that can fit the item:

h*=argmaxh:Nh(t-1)>0andh+sBh (4)

2.4 Reinforcement Learning Algorithm

We use the Proximal Policy Optimization (PPO) algorithm (?). We use a two-layer neural network with 256 hidden nodes each for both the actor and the critic. The input to both actor and critic network is the state, the output of the actor network is a vector giving the probabilities of taking any action in the action space, and the output of the critic network is predicted cumulative discounted reward from that state. During training, the agent explores the state space by sampling from the probability distribution of the actions generated by the policy network. We mask actions by reducing the probability of invalid actions to 0. We use a single machine with 4 GPUs and 32 CPUs for our experiments. We list all the hyperparameters used in Appendix A.

2.5 Results

For each sample item size distribution (BW, PP, LW), we train the RL algorithm (PPO) and compare to the baseline algorithms (SS and BF). We consider two variations, bin size of 9, with 100 items and distributions listed in section 2.1, and bin size of 100, with 1000 items and the following item size distribution:

  1. 1.

    item sizes: [1,2,3,4,5,6,7,8,9]

  2. 2.

    item probabilities for BW:

  3. 3.

    item probabilities for PP:

  4. 4.

    item probabilities for LW: [0,0,0,1/3,0,0,0,0,2/3].

Figure 1 plots the reward earned by the RL policy in training (blue) vs the Best Fit baseline (red) for bin size 100 and different item size distributions (BW, PP, and LW) as a function of training time (measured in minutes). The solid lines represent the mean reward of each policy, and the shaded bands represent the min/max rewards. By the end of training, RL either matches or outperforms the baseline policy for all three item size distributions. In particular, the reward gap between RL and baseline is the largest for LW distribution (which is expected, as both BF and SS are known to be sub-optimal for LW distribution).

In Table 1, we inspect numerically the trained RL policy vs. baseline for bin size 100. Supporting what we observed in the initial figures, this table shows the final RL policy outperforms or matches the baseline for each distribution.

We test generalization of the RL policy by evaluating the trained policy with a different item distribution than the one it was trained on. For PP and BW distributions, the trained policies translate well. Both the PP and BW policies perform as well as the baseline solutions for the LW distribution. The policy trained on the LW distribution generalizes reasonably well but does not do as well as the baseline solutions in the BW and PP distributions. We did observe overfitting if we pick model iterations from much later in training. We leave the study of overfitting and generalization across distributions as future work. A note on scaling: the training time for bin size 100 is about 3x, 4x and 10x more than bin size 9 for PP, BW and LW respectively. The bin size 9 results can be found in the supplementary material.

Algorithm Perfect Pack Bounded Waste Linear Waste
μ σ μ σ μ σ
RL with PP -49.0 29.5 -48.0 29.5 -1358 44.2
RL with BW -47.6 29.3 -53.9 26.4 -1368 48.0
RL with LW -258.6 69.3 -143.9 84.9 -880.2 43
SS -56.54 28.9 -56.61 30.2 -2091 92
Best Fit -52.01 29.5 -51.4 28.9 -1314 53
Table 1: RL and baseline solution comparison for bin packing. Mean and standard deviations are across 100 episodes.

Finally, we inspect the relative structure of the policies to ensure that RL is learning a sensible solution. In particular, we plot the state variable values as a function of the number of steps in an episode. Intuitively, the integral of these plots represents the waste, which we want to minimize. An optimal policy should show a (relatively) flat surface. We use bin size of 9 for this analysis for ease of manual inspection and study the linear waste distribution that highlights the difference between the Sum of Squares baseline and RL distinctly. From Figure 2, we see that the baseline policy leaves more open bins at a lower fullness, whereas RL only leaves open bins at level 8 (which cannot be closed once they reach that level). This indicates that the learned RL polocy is reasonable. For other distributions, the graphs for both the baseline and RL policy look similar to each other.

Figure 2: RL vs baseline solution for LW distribution

3 Multi-Period Newsvendor with Lead Times

The Newsvendor problem (see e.g. ? (?)) is a seminal problem in inventory management wherein we must decide on an ordering decision (how much of an item to purchase from a supplier) to cover a single period of uncertain demand. The objective is to trade-off the various costs incurred and revenues achieved during the period, usually consisting of sales revenue, purchasing and holding costs, loss of goodwill in the case of missed sales, and the terminal salvage value of unsold items.

In practice, decisions are rarely isolated to a single period, and they are repeatedly and periodically taken and thus have a downstream impact. This makes the problem non-trivial and has no known optimal solution, as compared to the single-period Newsvendor which has a known optimal solution when the demand distribution is known. Additionally, purchased units do not, in general, arrive quasi-instantaneously, but rather after a few periods of transit from the vendor to their final destination, known as the lead time. The presence of lead times further complicates the problem. Solving the multi-period newsvendor problem with lead times and lost sales is a notoriously difficult problem (?). It requires keeping track of orders placed in different periods, leading to what is known as the curse of dimensionality, rendering any exact solution impractical even for small lead times of 2 and 3 periods, and outright infeasible at higher dimensions. As a result, the problem forms a good test-bed for RL algorithms given that the observation of rewards is delayed by the lead time and that it can be formulated as a Markov Decision Problem. A number of heuristics have been developed for the lost sales problem, often based on order-up-to level policies for the equivalent model with backlogged demand. Comparisons in the performance of these two policies have been studied (?), and it has been shown that order-up-to policies are asymptotically optimal (?), thus making for good benchmark policies.

Figure 3: RL training reward in the multi-period newsvendor problem with Poisson demand and vendor lead period of 5.
(a) PPO Checkpoint 50
(b) PPO Checkpoint 1000
(c) PPO Checkpoint 1500
(d) PPO Checkpoint 10050
Figure 4: The Newsvendor policy graphs show the RL-learned policy for the quantity we will buy, as a function of how much inventory we have already ordered. The axes show the inventory to arrive in three and four periods respectively and the contour lines show the number of items bought by the RL policy. The agent policy improves over the training period. In the final policy, If we have already ordered a lot of inventory, this graph shows we will order less at this timestep.

3.1 Problem formulation

We consider the stationary, single-product, multi-period dynamic inventory management problem with vendor lead time (VLT) and stochastic demand. Here, the VLT l refers to the number of time steps between the placement and receipt of an order. The demand D is assumed to be stationary and Poisson distributed with mean μ. Items are purchased at a cost c and sold at a price p, and incur a penalty for lost sales k for each unit of unmet demand while any unit left over at the end of a period incurs a holding cost h. Items do not perish. A discount factor γ is used. No terminal value is awarded for the inventory state at end of episode.

The problem is formulated as a Markov Decision Process:


The state S of the problem is given by


where x0 is the on-hand inventory, x1 the units to be received one period hence, and so on.


In each period the state of the system is observed and an action A=q is taken, consisting of the size of the order placed and to arrive l time periods later.


We first incur the purchasing cost corresponding to the procured units given the action a. A realization d of the demand D (Poisson distributed with mean μ) is then observed, and demand is satisfied as much as is possible given on-hand levels. Missed sales incur a loss of goodwill k per unit, while leftover units incur a holding cost h:

R =pmin(x0,d)-ca-h(x0-d)+-k(d-x0)+.

where (x)+=max(x,0).


The state of the system S is then updated to S+ by moving all pipeline units downstream and incorporating the newly purchased units:

S+ =(p,c,h,k,μ,(x0-d)++x1,x2,,xl-1,a).

We do not impose action masking because there is no infeasible action in the pre-specified, positive continuous action space. We convert the continuous buy quantity to integer by post-process rounding.

3.2 Related work

Data-centric approaches (?) and reinforcement learning approachse (?) have recently been suggested for the newsvendor problem. These have so far still remained focused on the single period problem and often trying to learn some of the inputs, such as demand. A few other papers have considered Reinforcement Learning in the context of inventory management, such as (?), where a dual sourcing problem is tackled using RL.

3.3 Baseline Algorithm

As noted in the beginning of this section, it is impractical or even infeasible to solve the multi-stage newvendor problem exactly. However, it is possible to use heuristics that provide good approximations to the optimal solution. In particular, a way to tackle the problem is to approximate it by its backlogging counterpart, where orders are not lost if unsatisfied, for which a closed form solution of the optimal policy exists in the form of an order-up-to policy characterized by the following critical ratio:


As a result, letting z*=Fl-1(CR), where Fl is the cumulative distribution function of the l period demand, the policy is given by:

a =(z*-i=0l-1xi)+.

3.4 Reinforcement Learning Algorithm

We use Trust Region Policy Optimization (TRPO) (?) as implemented in the RLLab package (?), where the policy is represented by a neural network. We use a single machine with 4 GPUs and 32 CPUs for our experiments. We use a neural network of size (64,32) and the hyperparameters presented in Appendix C.

3.5 Results

We present the results obtained using a VLT of 5 and time horizon of 40. The economic parameters were chosen so that p,c[0,100], h[0,5] and k[0,10], while the demand mean μ was such that μ[0,200].

We sampled problem parameters as follows: pU[0,100], cU[0,p], hU[0,min(c,5)], kU[0,10] and μU[0,200] for the economic and demand parameters; where U[a,b] denotes a uniformly random variable between a and b. The initial state was simply set to be 𝟎. Figure 3 compares the results obtained by the RL algorithm to the baseline. The RL solution eventually beats the benchmark.

While solving this problem numerically is intractable, the optimal inventory policy structures are well known. It is thus of interest to check whether their properties are being learned by the RL algorithm. Given the dimension of the problem, we cannot observe the entire policy, but can investigate slices thereof. We thus fix price, cost, holding cost, penalty for lost sale and mean demand to 50, 25, 0.5, 5 and 100, respectively, and plot the optimal policy in the space (0,0,0,x3,x4) in Figure 4, where x3 and x4 stand for the inventory to arrive in three and four periods respectively. The figure shows contour curves of the buying quantity as a function of the inventory state. Intuitively, a good policy will buy less if we already have inventory on-hand (or in the pipeline). Visually, this should look like a smooth, decreasing frontier. We observe that the algorithm is learning this desired policy structure over the training period and we can start to observe monotonicity of the policy along most directions.

4 Vehicle Routing Problem

One of the most widely studied problems in combinatorial optimization is the traveling salesman problem (TSP), which involves finding the shortest route that visits each node in a graph exactly once and returns to the starting node. TSP is an NP-hard problem and has a variety of practical applications from logistics to DNA sequencing. The vehicle routing problem (VRP) is a generalization of TSP where one or more vehicles are expected to visit the nodes in a graph, for example to satisfy customer demand. VRP is also a well-studied topic and has several applications, especially in supply chain and logistics. An important extension of VRP is where some of the information about the graph is revealed over time, such as demand at each node and travel time. This class of VRP is called dynamic VRP (DVRP, also known as real-time or online VRP). Stochastic VRP (SVRP) is where one or more problem parameters are stochastic with some known probability distributions (as opposed to arbitrary or adversarial distributions). In many real-life applications, the relevant VRP is both stochastic and dynamic (SDVRP), which is also focus of this work. We formulate a variant of SVRP and compare solution approaches from the Operations Research (OR) and Reinforcement Learning (RL) literature.

4.1 Problem Formulation

We consider a VRP variant that is of an on-demand delivery driver. Orders arrive on the phone app of the driver in a dynamic manner throughout the problem horizon. Each order has a delivery charge (reward) known to the driver at the time of order creation, and it is assigned to a pickup location (e.g. restaurant) in the city. “City” here refers to the Euclidean space in a grid map where the VRP environment is created. The city consists of mutually exclusive zones that generate orders at different rates. At each time step, an order generated with a constant probability and assigned to a zone (i.e. p1=0.5,p2=0.3,p3=0.1,p4=0.1 for zones 1,…,4). Orders have rewards that come from zone-specific truncated normal distribution with different ranges (i.e. with minimum and maximum dollar values of [8,12], [5,8], [2,5] and [1,3] for each zone, respectively). Orders have delivery time windows, which is within 60 minutes from the creation of the order. The driver has to accept an order and pick up the package from a given location prior to delivery. Orders that are not accepted disappear probabilistically (i.e. with a time-out probability of 0.15 per time step) and assumed to be taken by some other competitor driver in the city. The vehicle has a capacity limit of 4 orders on the vehicle, but the driver can accept unlimited orders and plan the route accordingly. The driver incurs a cost per time step and unit distance traveled (0.1 for both), representing the money value of time and travel costs. The driver’s goal is to maximize the total net reward over an episode of 1000 time steps. This version of VRP is known as stochastic and dynamic capacitated vehicle routing problem with pickup and delivery, time windows and service guarantee. We choose this particular variant, which is less studied in the literature, because it more closely resembles real-world instances of the problem and gives us higher confidence that RL can generalize beyond our toy setup.


We include pickup location pt, driver info, and order info. Driver info contains the driver’s position ht and the capacity left ct. Order info contains the orders’ location lt, status wt (open, accepted, picked up or delivered/inactive), the time elapsed since each order’s generation et and the corresponding dollar value vt. Thus, the state is St=(pt,ht,ct,lt,wt,et,vt).


The agent chooses an action At from five options – accept the open order iP, pick up the accepted order iA, pick up the accepted order iA, the pickup location jR, or wait and stay unmoved.


The reward Rt is the total value of all delivered orders ft minus the cost qt. ft is divided into 3 equal parts for reward shaping: when the order gets accepted, picked up, and delivered respectively. Thus we have:


where qt=(qtime+qmove+qfailure). qtime is the time cost, qmove is the moving cost (per time step). qfailure is a large penalty (50) if the agent accepts an order but fails to deliver within the promised time.

The vehicle’s capacity remains unchanged if an order is accepted but not picked up. In effect, this grants the agent the flexibility to accept more orders than available capacity, which can be picked up later when space allows. The action of heading to a specific pickup location enables the agent to learn to stay near popular pick up locations. We impose action masking during the policy training. The agent cannot perform the following invalid actions: (i) pick up an order when its remaining capacity is 0; (ii) pick up an order that is not yet accepted; (iii) deliver an order that is not in transit.

4.2 Related Work

There is a substantial literature on VRP (?). The closest VRP variant to the problem considered in this paper is the Pickup and Delivery Problem with Time Windows (PDPTW) (?), which has some additional complexities over vanilla VRP. Due to such complexities, there are fewer exact solution approaches (??), and the majority of the literature focuses on heuristics. When the problem is also stochastic and dynamic, exact solution methods become intractable except for very specific problem settings. In such cases, anticipatory algorithms that simulate sample future scenarios and merge solutions to those samples are a common choice (?).

Reinforcement Learning (RL) methods have been successfully used for solving the Traveling Salesman Probelm (TSP). ? (?) employ a pointer network to optimize the policy, and train an actor-critic algorithm with the negative tour length as the reward signal. ? (?) develop a single model based on graph embeddings. They use the DQN algorithm to train a greedy policy and graph embedding network simultaneously. For VRP, ? (?) utilize the transformer neural network architecture to develop a model fully based on attention layers. Their proposed model is trained by policy gradients with a greedy baseline, and evaluated on both standard Capacitated VRP (CVRP) and Split Deliverry VRP (SDVRP). ? (?) further improve the algorithm using embedded inputs and allow the customers and their demands to be stochastic.

4.3 Baseline Algorithm

We modify the classical three-index Mixed Integer Programming (MIP) formulation (??). This deterministic MIP is solved for the available orders in the environment. It is further resolved when a new order arrives, if one of the existing orders expires, or when all of the actions are executed. When we solve the MIP, the orders that had been already accepted or were in transit are modeled as starting conditions. The details of our MIP model is in Appendix D. We leave anticipatory models to future work.

(a) RL vs baseline solution for VRP with 3 pick-up locations, 5 orders and map sizes 5×5 and 8×8
(b) RL vs baseline solution for VRP with 2 pick-up locations, map 5×5, and number of orders 5 and 10.
Figure 5: RL vs baseline during policy training process.

4.4 Reinforcement Learning Algorithm

To train the policy, we apply the APE-X DQN algorithm (?) due to its ability to scale by generating more experience replays and picking from them in a distributed prioritized fashion22 2 We also applied PPO with default hyperparameters provided in RLLib. The reward increased much slower than that of APEX-DQN and was not able to beat the baseline after 1 day of training.. We use a two-layer neural network with 512 hidden units each. We list all hyperparameters in Appendix E. We use a machine with 1 GPU and 8 CPUs for our experiments.

4.5 Results

For multiple problem scales determined by map size ({5×5,8×8}), maximum number of orders (order{5,10}) and number of pick-up locations in the map (n{2,3}), we conduct experiments to compare the behavior of RL and the MIP baseline solutions. We examine the trained RL policy’s ability to generalize to different order distributions. The hyperparameters used for algorithm training are taken from RLLib robotics examples without fine-tuning. Overall, the RL approach outperforms the baseline across different instance sizes, and generalizes well for unseen order patterns.

Figure 4(a)-4(b) compares the episodic rewards for the RL policy and the baseline algorithm during training. The shaded band around the mean line shows the minimum and maximum rewards. For readability, the graphs are clipped to skip the initial 3.5 hours of training as the rewards are highly negative and skew the Y-axis scale. With larger map size or higher order number, the training time required for the agent to achieve rewards equivalent to baseline is higher. This is expected as both the observation and action space increase, the agent requires more exploration to converge to a reasonable policy. Even after three days of training, the rewards for larger instances keep growing gradually. The agent slowly learns to fully utilize the vehicle capacity and to not accept orders which are likely to incur penalty.

As the agent is trained longer, there is potential for the policy to overfit. In order to test generalizability, we train another policy with a shifted hot order-zone distribution (0.1,0.5,0.3,0.1), and evaluate against the baseline results both using the original order-zone distribution (0.5,0.3,0.1,0.1). Table 2 summarizes the evaluation results. It is observed that this policy is able to outperform the baseline consistently during evaluation phase.

We also present the rewards with and without the order miss penalty qfailure for the same trained policy to further understand the agent’s behavior about order delivery misses. The reward values are close for problems with fewer number of pick-up locations and fewer orders. As the number of pick-up locations becomes larger, the gap between the rewards increases. One explanation is the agent struggles with the increased complexity of order deliveries from different pick-up locations, and its action often changes in the middle of a delivery, so the likelihood of missing the order delivery increases. This behavior is also seen if the number of orders is higher. Even though the RL agent reward is better than the baseline, there is still scope for improvement by reducing the number of order delivery misses.

Problem Instance RL Evaluation Reward MIP Reward
Without qfailure With qfailure
5 by 5 map, 5 orders
2 pick-up locations
5 by 5 map, 5 orders
3 pick-up locations
5 by 5 map, 10 orders
2 pick-up locations
8 by 8 map, 5 orders
2 pick-up locations
8 by 8 map, 5 orders
3 pick-up locations
Table 2: RL and baseline solution comparison for VRP. Values in the brackets are standard deviations and mean reward is calculated using 50 episodes.

5 Conclusion and Future Work

In this paper, we have established Deep Reinforcement Learning (DRL) benchmarks for three canonical dynamic resource allocation problems: Bin Packing, Newsvendor, and Vehicle Routing. We formulated a Markov Decision Process for each problem, and compared established algorithms with vanilla RL techniques. In each case, RL policy either outperforms or is competitive with the baseline. While we do not overcome the NP-hardness of the problems, as wall-clock training time scales with problem size, we find that DRL is a good tool for these problems. These results illustrate the potential value of RL for a wide range of real-world industrial online stochastic problems, from order assignment, to retail buying, to real-time routing. Our experiments indicate the following issues as important for making RL solutions more practical in the future: building effective simulators, learning from historical data, initialization of the RL model, overfitting to a particular distribution and enforcement of constraints (e.g. via action masking). We used out-of-the-box RL algorithms, with almost no problem-specific tweaking and simple 2-layer neural networks. Further research can add value by testing various RL algorithms, neural network structures, etc. and seeing their relative value in each problem especially as problem complexity scales up (i.e., solving real-world instances of these problems will likely require innovation on the RL side). In this paper we only looked at canonical, theoretical models. Further research should endeavor to apply these RL techniques to real-world industrial problems.



Appendix A Bin Packing - HyperParameters

Discount factor 0.995 KL coefficient 1.0
Experience Buffer 320000 Learning rate 0.0001
SGD Mini-batch 32768 Epochs 10
Entropy coefficient 0 # Workers 31
Episode length 10000 clip param 0.3
Table 3: Hyperparameters used in PPO for Bin Packing

Appendix B Bin Packing Results - Bin Size of 9

Algorithm Perfect Pack Bounded Waste Linear Waste
μ σ μ σ μ σ
RL with PP -39.4 15.1 -29.0 5.7 -106.8 70.8
RL with BW -57K 24K -5.06 2.77 -79K 8.5K
RL with LW -47.1 7.7 -41.3 11.8 -71.8 10
SS -50.2 28.6 -17.27 3.21 -212.2 68.7
Best Fit -123.7 8.3 -127.49 9.6 -130.6 7.7
Table 4: RL and baseline solution comparison for bin packing with bin size of 9 and 1000 items. Mean and standard deviations are calculated across 100 episodes. Note that RL policy trained with BW distribution does not generalize well to PP and LW distribution, giving very negative rewards. Also note that SS outperforms BF for the bin size 9 test case compared to bin size 100.

Appendix C Newsvendor - HyperParameters

Learning rate 0.00001
SGD Mini-Batch Size 32768
Train Batch Size 320000
Episode length 40
Discount factor 1
Epochs 5
Neural Network Fully connected, 64x32
Action Space Normalized between 0 to 1
Table 5: Hyperparameters used in PPO for Newsvendor

Appendix D VRP baseline MIP formulation


V: Current vehicle location, V={0}
P: Pickup nodes (copies of the restaurant nodes, associated with the orders that are not in transit)
D: Delivery nodes representing the orders that are not in transit, D={j|j=i+n,iP,n=|P|}
A: Nodes representing the orders that are accepted by the driver; AD
T: Delivery nodes representing the orders that are in transit
R: Nodes representing the restaurants, used for final return)
N: Set of all nodes in the graph, N=VPDTR
E: Set of all edges, E={(i,j),i,jN}

Decision variables

xij: Binary variable, 1 if the vehicle uses the arc from node i to j, 0 otherwise; i,jN
yi: Binary variable, 1 if the order i is accepted, 0 otherwise; iP
Qi: Auxiliary variable to track the capacity usage as of node i; iN
Bi: Auxiliary variable to track the time as of node i; iN


n: Number of orders available to pick up, n=|P|
cij: Symmetric Manhattan distance (in miles) matrix between node i and j, (i,j)E
qi: Supply (demand) at node i, q0=|T|;qi=1,iP;qi=-1,iDT;qi=0R
li: Remaining time to deliver order i, iDT
m: Travel cost per mile
ri: Revenue for order associated with pick up node i, iP
U: Vehicle capacity
M: A very big number
t: Time to travel one mile
d: A constant positive service time spent on accept, pickup, delivery


maxx,y,Q,BiPriyi-m(i,j)Ecijxijs.t.  jNxij=yiiPjNxij-jNxi+n,j=0iPyi=1iAjNxij=1iVTiNRjRxij=1jNRxji-jNxij=0iPDTQi+qj-M(1-xij)Qji,jNmax(0,qi)QiiNmin(U,U+qi)QiiNBi+d+cijt-M(1-xij)Bji,jNBi+ci,i+nt-M(1-yi)Bi+niPdiPAyi=B0BiliiDTxij,yi{0,1}i,jN (5)

Appendix E Vehicle Routing Problem - HyperParameters

Replay buffer alpha 0.5 # steps for Q 3
Replay buffer eps 0.1 Learning rate 1e-3
Final explore eps 0.01 Adam epsilon 1.5e-4
Replay buffer size 1e6 # Workers 7
Episode length 1000 Training Batch 512
Table 6: Hyperparameters used in APEX-DQN for VRP, taken directly from the atari APEX example provided in RLLib. No hand-tuning was performed.