Deep Reinforcement Learning in System Optimization

  • 2019-08-04 05:55:56
  • Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Joseph Gonzalez, Krste Asanovic, Ion Stoica
  • 21

Abstract

The recent advancements in deep reinforcement learning have opened newhorizons and opportunities to tackle various problems in system optimization.Such problems are generally tailored to delayed, aggregated, and sequentialrewards, which is an inherent behavior in the reinforcement learning setting,where an agent collects rewards while exploring and exploiting the environmentto maximize the long term reward. However, in some cases, it is not clear whydeep reinforcement learning is a good fit for the problem. Sometimes, it doesnot perform better than the state-of-the-art solutions. And in other cases,random search or greedy algorithms could outperform deep reinforcementlearning. In this paper, we review, discuss, and evaluate the recent trends ofusing deep reinforcement learning in system optimization. We propose a set ofessential metrics to guide future works in evaluating the efficacy of usingdeep reinforcement learning in system optimization. Our evaluation includeschallenges, the types of problems, their formulation in the deep reinforcementlearning setting, embedding, the model used, efficiency, and robustness. Weconclude with a discussion on open challenges and potential directions forpushing further the integration of reinforcement learning in systemoptimization.

 

Quick Read (beta)

Abstract

The recent advancements in deep reinforcement learning have opened new horizons and opportunities to tackle various problems in system optimization. Such problems are generally tailored to delayed, aggregated, and sequential rewards, which is an inherent behavior in the reinforcement learning setting, where an agent collects rewards while exploring and exploiting the environment to maximize the long term reward. However, in some cases, it is not clear why deep reinforcement learning is a good fit for the problem. Sometimes, it does not perform better than the state-of-the-art solutions. And in other cases, random search or greedy algorithms could outperform deep reinforcement learning. In this paper, we review, discuss, and evaluate the recent trends of using deep reinforcement learning in system optimization. We propose a set of essential metrics to guide future works in evaluating the efficacy of using deep reinforcement learning in system optimization. Our evaluation includes challenges, the types of problems, their formulation in the deep reinforcement learning setting, embedding, the model used, efficiency, and robustness. We conclude with a discussion on open challenges and potential directions for pushing further the integration of reinforcement learning in system optimization.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Deep Reinforcement Learning in System Optimization

 

Ameer Haj-Ali0 0  Nesreen K. Ahmed0  Ted Willke0  Joseph Gonzalez0  Krste Asanovic0  Ion Stoica0 


footnotetext: \forloop@affilnum1¡ 0University of California, Berkeley. Correspondence to: Ameer Haj-Ali <[email protected]>. \@xsect

Reinforcement learning (RL) is a class of stochastic optimization techniques for Markov Decision Processes (MDP) (bellman1957), when the MDP is not known. In RL, an agent continually interacts with the environment (kaelbling1996reinforcement; sutton2018reinforcement). In particular, the agent observes the state of the environment, and based on this observation takes an action. The goal of the RL agent is then to compute a policy–a mapping between the environment states and actions–that maximizes a long term reward. There are multiple ways to extrapolate the policy. Non-approximation methods usually fail to predict good actions in states that were not visited in the past, and require storing all the action-reward pairs for every visited state, a task that would incur a huge memory overhead and complex computations. Instead, approximation methods have been proposed. Among the most successful ones is using a neural network in junction with RL, also known as deep RL. Deep models allow RL algorithms to solve complex problems in an end-to-end fashion, handle unstructured environments, learn complex functions, or predict actions in states that have not been visited in the past. Deep RL is gaining wide interest recently due to its success in robotics, Atari games, and superhuman capabilities (mnih2013playing; doya2000reinforcement; kober2013reinforcement; peters2003reinforcement). Deep RL was the key technique behind defeating the human European champion in the game of Go, which has long been viewed as the most challenging of classic games for artificial intelligence (silver2016mastering).

Many system optimization problems have a nature of delayed, sparse, aggregated or sequential rewards where improving the long term sum of rewards is more important than a single local reward. For example, an RL environment can be a computer cluster. The state can be defined as a combination of the current resource utilization, available resources, time of the day, duration of jobs waiting to run, etc. The action can determine on which resources to schedule each job. The reward can be the total revenue, jobs served in a time window, wait time, energy efficiency, etc., depending on the objective. In this example, if the objective is to minimize the waiting time of all jobs, then a good solution must interact with the computer cluster and monitor the overall wait time of the jobs to determine good schedules. This nature is inherent in RL. RL also has the advantage of self-learning, adaptation, and exploration, which does not require prior knowledge. RL can also learn sophisticated system characteristics that a straight forward solution like first come first served allocation scheme cannot. For instance, it can be better to put recent jobs that take a long time to run on hold; if it is expected that an upcoming job will take a shorter time and fewer resources is arriving soon.

In this paper, we review different attempts to overcome system optimization challenges with the use of deep RL. Unlike previous reviews (hameed2016survey; mahdavinejad2018machine; krauter2002taxonomy; wang2018machine; ashouri2018survey; luong2019applications) that focus on machine learning methods, without discussing deep RL models or apply it to one system problem, we focus on deep RL in system optimization in general. From reviewing prior works, it is evident that standardized metrics for assessing the deep RL solutions in system optimization problems are lacking. We thus propose quintessential metrics to guide future works in evaluating the use of deep RL in system optimization. We also discuss and address multiple challenges that are faced while integrating deep RL in the system.

We observe that the system problems that have been tackled so far with deep RL include packet classification (liang2019neural), congestion control, (jay2019deep; ruffy2018iroko), resource allocation and scheduling in the cloud (mao2016resource; he2017software; he2017integrated; tesauro2006online; xu2017deep; liu2017hierarchical; arabnejad2017comparison; xu2012url; rao2009vconf), query optimization (krishnan2018learning; marcus2018deep; zhong2017seq2sql; ortiz2018learning; guu2017language; liang2016neural; marcus2019neo), compiler optimization (haj2019autophase; kulkarni2012mitigating), and scheduling (addanki2019placeto; paliwal2019regal; coons2008feature), with most of the works focusing on resource allocation and scheduling in the cloud. We anticipate deep RL in system optimization will grow. We believe much more room for improvement is available in both the RL algorithms, system simulators, and problems tackled.

\@xsect
Figure 1: Reinforcement learning. By observing the state of the environment, the agent takes actions for which he receives rewards. The agent’s goal is to take actions that maximize cumulative reward.

One of the promising machine learning approaches is reinforcement learning (RL), in which an agent learns by continually interacting with an environment (kaelbling1996reinforcement). In RL, the agent observes the state of the environment, and based on this state/observation takes an action as illustrated in figure 1. The RL agent’s ultimate goal is to compute a policy–a mapping between the environment states and actions–that maximizes expected future reward. RL can be viewed as a stochastic optimization solution for solving Markov Decision Processes (MDPs) (bellman1957), when the MDP is not known. An MDP is defined by a tuple with four elements: S,A,P(s,a),r(s,a) where S is the set of states of the environment, A describes the set of actions or transitions between states, sP(s,a) describes the probability distribution of next states given the current state and action and r(s,a):S×AR is the reward of taking action a in state s. Given an MDP, the goal of the agent is to gain the largest possible cumulative reward. The objective of an RL algorithm associated with an MDP is to find a decision policy π*(a|s):sA that achieves this goal for that MDP:

π*=argmaxπ𝔼τπ(τ)[tr(st,at)], (1)

where τ is a sequence of states and actions that define a single episode, and T is the length of that episode. Deep RL leverages a neural network to learn the policy (and sometimes the reward function). Over the past couple of years, a plethora of new deep RL techniques have been proposed (mnih2016asynchronous; ross2011; sutton2000; schulman2017proximal; lillicrap2015continuous).

Policy Gradient (PG) (sutton2000), for example, uses a neural network to represent the policy. This policy is updated directly by differentiating the term in Equation 1 as follows:

θJ=θ𝔼τπ(τ)[tr(st,at)]=𝔼τπ(τ)[(tθlogπθ(at|st))(tr(st,at))]1Ni=1N[(tθlogπθ(ai,t|si,t))(tr(si,t,ai,t))] (2)

and updating the network parameters (weights) in the direction of the gradient:

θθ+αθJ, (3)

Proximal Policy Optimization (PPO) (schulman2017proximal) improves on top of PG for more deterministic, stable, and robust behavior by limiting the updates and ensuring the deviation from the previous policy is not large.

By contrast, Q-Learning (watkins1992q), state-action-reward-state-action (SARSA) (rummery1994line) and deep deterministic policy gradient (DDPG) (lillicrap2015continuous) are temporal difference methods, i.e., they update the policy on every timestep (action) rather than on every episode. Furthermore, these algorithms bootstrap and instead of using a neural network for the policy itself, they learn a Q-function, which estimates the long term reward from taking an action. The policy is then defined using this Q-function. In Q-learning the Q-function is updated as follows:

Q(st,at)Q(st,at)+r(st,at)+γmaxat[Q(st,at)], (4)

in other words the Q-function updates are performed based on the action that maximizes the value of that Q-function. In SARSA on the other hand, the Q-function is updated as follows:

Q(st,at)Q(st,at)+r(st,at)+γQ(st+1,at+1), (5)

i.e., the Q-function updates are performed based on the action that the policy would pick given state st. DDPG fits multiple neural networks to the policy, the Q-function and their target time-delayed copies that slowly track the learned networks and greatly improve stability in learning.

Algorithms such as upper-confidence-bound and greedy could later be used to determine the way the policy is defined based on the Q-function (auer2002using; sutton2018reinforcement). The surveyed works in this paper focus on the epsilon greedy method where the policy is defined as follow:

π*={argmaxaQ(a),w.p. 1-ϵrandomaction,w.p. ϵ (6)

A method is considered to be on-policy if the new policy is computed directly from the decisions made by the current policy. PG, PPO and SARSA are thus on policy while DDPG and Q-Learning are off policy. All the mentioned methods are model free; they do not require a model of the environment to learn, but instead learn directly from the environment by trial and error. In some cases a model of the environment could be available. It might also be possible to learn a model of the environment. This model could be used for planning and enable more robust training as less interaction with the environment may be required.

Table 1: The RL agents used for each system problem.
Field Problem
Variations of
Q-Learning
SARSA PG PPO DDPG
Evolutionary
Methods
Networking Congestion Control (ruffy2018iroko)
(jay2019deep)
(ruffy2018iroko)
(ruffy2018iroko)
Classification (liang2019neural)
Cloud Performance
(xu2012url)
(tesauro2006online)
(arabnejad2017comparison)
(rao2009vconf)
(mao2016resource)
Caching
(he2017software)
(he2017integrated)
Energy Efficiency
(xu2017deep)
(liu2017hierarchical)
SQL
Sequence to
SQL/Program
(zhong2017seq2sql)
(guu2017language)
(liang2016neural)
Query Optimization
(krishnan2018learning)
(marcus2019neo)
(ortiz2018learning)
(marcus2018deep)
Compiler Phase Ordering (haj2019autophase) (haj2019autophase) (kulkarni2012mitigating)
Scheduling Device Placement
(paliwal2019regal)
(addanki2019placeto)
(paliwal2019regal)
(coons2008feature)

Most RL methods considered in this work are structured around value function estimation (e.g., Q-values) and using gradients to update the policy. However, it not always the case. For example, genetic algorithms, simulated annealing, genetic programming, and other gradient-free optimization methods - often called evolutionary methods (sutton2018reinforcement) - can also solve RL problems. Also, they are analogous to the way biological evolution produces organisms with skilled behavior. Evolutionary methods can be effective if the space of policies is sufficiently small, the policies are common and easy to find, and the state of the environment is not fully observable. This works considers only the deep versions of these methods, i.e., using a neural network in conjunction with evolutionary methods that are usually used to evolve and update the neural network parameters or vice versa.

Multi-armed bandits (berry1985bandit; auer2002finite) simplify RL by removing the learning dependency on state and thus providing evaluative feedback that depends entirely on the action taken (1-step RL problems). The actions usually are decided in a greedy manner, by updating the benefit estimates of performing each action, independently from other actions. To consider the state in a bandit solution contextual bandits can be used (chu2011contextual). In many cases a bandit solution might perform as well as the complex full RL solution or even better/faster.

\@xsect

Multiple works proposed in the past to use different than deep neural network approximation methods for RL in system optimization. These works include reliability and monitoring (das2014reinforcement; zhu2007reinforcement; zeppenfeld2008learning), and memory management (ipek2008self; andreasson2002collect; peled2015semantic; diegues2014self) in multicore systems. Congestion control (li2016learning; silva2016smart), packet routing (choi1996predictive; littman2013distributed; boyan1994packet), algorithm selection (lagoudakis2000algorithm), cloud caching (sadeghi2017optimal), energy efficiency (farahnakian2014energy) and performance (peng2015random; jamshidi2015self; barrett2013applying; arabnejad2017comparison; mostafavi2018reinforcement). Instead of using a neural network to approximate the policy, these works used tables, linear approximations, and other approximation methods to train and represent the policy. Tables were generally used to store the Q-values, i.e., one value for each action, state pair, which are used in training and this table becomes the ultimate policy. Neural networks in general can outperform both approaches (lin1993reinforcement). Unlike neural networks, tables require very large memory to store every visited state and action, and cannot properly approximate actions in new states that might be similar to other states visited in the past. Neural networks can also learn linear and non linear functions.

\@xsect

In this section we discuss the different system challenges tackled using RL and divide them into two categories: Episodic Tasks in which the agent-environment interaction naturally breaks down into a sequence of separate terminating episodes and Continuing Tasks in which it does not. For example, when optimizing resources in the cloud, the jobs arrive continuously and there is not a clear termination state. But when optimizing the order of SQL joins, the query has a finite number of joins, and thus after enough steps the agent arrives at a terminating state.

\@xsect

An important feature of RL is learning without prior knowledge of the target scenario, ability to learn and update environmental knowledge by actual observations and adapt in unpredictable environment. Jobs in the cloud arrive in an unpredictable and continuous manner. This might explain why most system optimization challenges tackled with RL were in the cloud (mao2016resource; he2017software; he2017integrated; tesauro2006online; xu2017deep; liu2017hierarchical; xu2012url; rao2009vconf). A good job scheduler in the cloud should make decisions that are good in the long term. Such scheduler should sometimes forgo short term gains in an effort to realise greater long term benefits. For example, it might be better to delay a long running job if a short running job is expected to arrive soon. It should also be adaptive to variations in the underlying resource performance and scale in the presence of new or unseen workloads combined with large numbers of resources.

These schedulers have a variety of objectives, including minimizing average performance of jobs and optimizing the resource allocation of the virtual machines (mao2016resource; tesauro2006online; xu2012url; rao2009vconf), optimizing data caching on edge devices and base stations versus the cloud (he2017software; he2017integrated), and energy efficiency (xu2017deep; liu2017hierarchical). Table 1 lists the algorithms used for addressing each problem. The challenge in packet classification is similar to scheduling jobs in the cloud, in that the packets arrive continuously and in an unpredictable manner. (liang2019neural) has thus leveraged deep RL to overcome this challenge.

Interestingly, for cloud challenges most works were driven by Q-learning (or the very similar SARSA). In the absence of a complete environmental model, model free Q-learning can be used to generate optimal policies. It is able to make predictions incrementally by bootstrapping the current estimate onto previous estimates and provide good sample efficiency (jin2018q). It is also characterized by inherent continuous temporal difference behavior where the policy could be updated immediately after each step (not the end of trajectory); something that might be very useful for online adaptation.

\@xsect

Due to the sequential nature of decision making in RL, the order of the actions taken has a major impact on the rewards the RL agent collects. The agent can thus learn these patterns and pick more rewarding actions. Previous works took advantage of this behavior in RL to optimize congestion control (jay2019deep; ruffy2018iroko), decision trees for packet classification (liang2019neural), sequence to SQL/program translation (zhong2017seq2sql; guu2017language; liang2016neural), order of SQL joins (krishnan2018learning; ortiz2018learning; marcus2018deep; marcus2019neo), and compiler phase ordering (haj2019autophase; kulkarni2012mitigating) and device placement (addanki2019placeto; paliwal2019regal).

In these problems, after enough steps, the agent will always arrive at a clear terminating step. For example, in query join order optimization, the number of joins is finite and known from the query or in congestion control – where the routers need to adapt the sending rates to provide high throughput without comprising fairness – the updates are performed on a fixed number of senders/receivers known in advance. These updates combined define one episode. Due to this episodic behavior it might be possible to explain why there is a bigger trend towards using PG methods for these types of problems as they don’t require a continuous temporal difference behavior and can operate in batches of multiple queries for example. Nevertheless, in some cases Q-learning was still used, mainly for sample efficiency as the environment step might take a relatively long time.

To improve the performance of PG methods it is possible to take advantage of the way the gradient computation is performed. If the environment is not needed to generate the observation, it is possible to save many environment steps. This is achieved by rolling out the whole episode from interacting only with the policy and performing one environment step at the very end. The sum of rewards will be the same as the reward received from this environment step. For example, in query optimization, since the observations are encoded directly from the actions, and the environment is mainly used to generate the rewards, it will be possible to repeatedly perform an action, form the observation directly from this action, feed it to the policy network. After the end of the episode, the environment could be triggered to get the final reward, which would be the sum of the intermediate rewards. This could significantly reduce the training time.

\@xsect

Continuous policies can handle both continuous and episodic tasks while episodic policies cannot. So, for example, Q-Learning can handle all the tasks mentioned in this work, while PG based methods cannot directly handle it with out special handling. For example, in (mao2016resource), the authors limited the the scheduler window of jobs to M, allowing the agent in every time step to schedule up to M jobs out of all arrived jobs. The authors also discussed this issue of ”bounded time horizon” and hoped to overcome it by using a value network to replace the time-dependent baseline. It is interesting to notice that previous works on continuous system optimization tasks using non deep RL approaches (choi1996predictive; littman2013distributed; boyan1994packet; peng2015random; jamshidi2015self; barrett2013applying; arabnejad2017comparison; sadeghi2017optimal; farahnakian2014energy) used Q-Learning.

One solution for handling continuing problems with PG based methods without episode boundaries is to define performance in terms of the average rate of reward per time step (sutton2018reinforcement) (Chapter 13.6). Such approach could help better fit the continuous problems to these episodic RL algorithms.

Table 2: Problem formulation in the deep RL setting.
Description Reference State/Observation Action Reward Objective Model
congestion control
(jay2019deep)
(ruffy2018iroko)
histories of sending
rates and resulting
statistics (e.g., loss rate)
changes to sending rate
throughput
and negative of
latency or
loss rate
maximize throughput
while maintaining
fairness
FCNN
packet classification (liang2019neural)
encoding of the
node, e.g., the split
rules
cutting a classification
tree node or partitioning
a set of rules
classification time
/memory
footprint
build optimal decision
tree for packet
classification
FCNN
SQL join
order optimization
(krishnan2018learning)
(ortiz2018learning)
encoding of the query next join negative cost
minimize execution
time
FCNN
query optimization (marcus2019neo) query/plan encodings next join negative cost performance
tree conv.,
FCNN
SQL join
order optimization
(marcus2018deep)
matrix encoding
of the join
tree structure
next join 1/cost
minimize execution
time
FCNN
sequence to
SQL
(zhong2017seq2sql)
SQL vocabulary,
question, column
names
query corresponding
to the token
-2 invalid query,
-1 valid but wrong result,
+1 valid and right result
tokens in the
WHERE clause
LSTM
language to
program translation
(guu2017language)
natural language
utterances
a sequence of
program tokens
1 if correct result
0 otherwise
generate equivalent
program
LSTM,
FCNN
semantic parsing (liang2016neural)
embedding of
the words
a sequence of
program tokens
positive if correct
0 otherwise
generate equivalent
program
RNN,
GRU
resource allocation
in the cloud
(mao2016resource)
current allocation of
cluster resources &
resource profiles of
waiting jobs
next job
to schedule
Σi(-1Ti) for
all jobs in the
system (Ti is the
duration of job i)
minimize average
job slowdown
FCNN
resource allocation (he2017software; he2017integrated)
status of edge
devices, base stations,
content caches
which base station,
to offload/cache
or not
total revenue
maximize total
revenue
CNN
resource allocation
in the cloud
(tesauro2006online)
current allocation
& demand
next resource
to allocate
payments maximize revenue FCNN
resource allocation
in cloud radio
access networks
(xu2017deep)
active remote radio
heads & user demands
which remote
radio heads
to activate
negative power
consumption
power
efficiency
FCNN
cloud resource
allocation &
power management
(liu2017hierarchical)
current allocation
& demand
next resource
to allocate
linear combination of
total power consumption,
VM latency,
& reliability metrics
power efficiency
autoencoder,
weight sharing
& LSTM
automate virtual
machine (VM)
configuration process
(rao2009vconf)
(xu2012url)
current resource
allocations
increase/decrease
CPU/time/memory
throughput
-response time
maximize performance
FCNN,
model-based
compiler phase
ordering
(kulkarni2012mitigating)
(haj2019autophase)
program features
next optimization
pass
performance
improvement
minimize execution
time
FCNN
device placement
(paliwal2019regal)
(addanki2019placeto)
computation graph
placement/schedule
of graph node
speedup
maximize performance
& minimize peak
memory
GNN/FCNN
distributed instr-
uction placement
(coons2008feature) instruction features
instruction placement
location
speedup maximize performance FCNN
Table 3: Evaluation results.
Work Problem
Environment
Step Time
Number of Steps
Per Iteration
Number of training
Iterations
Total Number
Of Steps
Improves State
of the Art
Compares Against
Bandit/Random
Search
packet
classification
(liang2019neural) 20-600ms up to 60,000 up to 167 1,002,000 ✓(18%)
congestion
control
(jay2019deep) 50-500ms 8192 1200 9,830,400 ✓(similar)
congestion
control
(ruffy2018iroko) 0.5s N/A N/A 50,000-100,000
resource
allocation
(mao2016resource) 10-40ms 20,000 1000 20,000,000 ✓(10-63%)
resource
allocation
(he2017software)
(he2017integrated)
N/A N/A 20,000 N/A no comparison
resource
allocation
(tesauro2006online) N/A N/A 10,000-20,000 N/A no comparison
resource
allocation
(xu2017deep) N/A N/A N/A N/A no comparison
resource
allocation
(liu2017hierarchical) 1-120 minutes 100,000 20 2,000,000 no comparison
resource
allocation
(rao2009vconf)
(xu2012url)
N/A N/A N/A N/A no comparison
SQL
Joins
(krishnan2018learning)  10ms  640  100  64,000 ✓(70%)
SQL
joins
(ortiz2018learning) N/A N/A N/A N/A no comparison
SQL
joins
(marcus2019neo) 250ms 100-8,000 100 10,000-80,000 ✓(10-66%)
SQL
joins
(marcus2018deep) 1.08s N/A N/A 10,000 ✓(20%)
sequence to
SQL
(zhong2017seq2sql) N/A 80,654 300 24,196,200 ✓(similar)
language to
program trans.
(guu2017language) N/A N/A N/A 13,000 ✓(56%)
semantic
parsing
(liang2016neural) N/A 3,098 200 619,600 ✓(3.4%)
phase
ordering
(haj2019autophase)  1s N/A N/A 1,000-10,000 ✓(similar)
phase
ordering
(kulkarni2012mitigating)
13.2 days
for all steps
N/A N/A N/A
device
placement
(addanki2019placeto) N/A (seconds) N/A N/A 1,600-94,000 ✓(3%)
device
placement
(paliwal2019regal) N/A (seconds) N/A N/A 400,000 ✓(5%)
instruction
placement
(coons2008feature) N/A (minutes) N/A 200 N/A (days)
\@xsect

Table 2 lists all the works we reviewed and their problem formulation in the context of RL, i.e., the model, observations, actions and rewards definitions. Among the major challenges when formulating the problem in the RL environment is properly defining the system to RL environment translation layer, i.e., state, action spaces and rewards. The rewards are generally sparse and behave similarly for different actions, making the RL training ineffective due to bad gradients. The states are generally defined using hand engineered features that are believed to encode the state of the system. This results in a large state space with some features that are less helpful than others and rarely capturing the actual system state. Using model based RL can alleviate this bottleneck and provide more sample efficiency. (liu2017hierarchical) used auto-encoders to help reduce the state dimensionality. The action space is also large but generally represents actions that are directly related to the objective. Another challenge is the environment step. Some tasks require long time for the environment to perform one step, significantly slowing the learning process of the RL agent.

Interestingly, most works focus on using simple out of the box fully connected neural networks (FCNNs), while some works that targeted parsing and translation ((liang2016neural; guu2017language; zhong2017seq2sql)) used recurrent neural networks (RNNs) (graves2013speech) due to their ability to parse strings and natural language. While FCNNs are simple and easy to train to learn a linear and non-linear function policy mappings, sometimes having a more complicated network structure suited for the problem could further improve the results.

\@xsect

Table 3 lists training, and evaluation results of the reviewed works. We consider the time it takes to perform a step in the environment, the number of steps needed in each iteration of training, number of training iterations, total number of steps needed, and whether the prior work improves the state of the art and compares against random search/bandit solution.

The total numbers of steps and the the cost of each environment step is important to understand the sample efficiency and practicality of the solution, especially when considering RL’s inherent sample inefficiency (schaal1997learning; hester2018deep). For different workloads the number of samples needed varied from thousands to millions. The environment step time also varied from milliseconds to minutes. In multiple cases the interaction with the environment is very slow. Note that in most cases when the environment step time was a few milliseconds, it was because it was a simulated environment, not a real one. We observe that for faster environment steps more training samples were gathered to leverage that and further improve the performance. This excludes (liu2017hierarchical) where a cluster was used and thus more samples could be processed in parallel.

As listed in Table 3, many works did not provide sufficient data to reproduce the results. Reproducing the results is necessary to further improve the solution and enable future evaluation of the solution and comparison against it.

\@xsect

A few RL benchmark toolkits for developing and comparing reinforcement learning algorithms, and providing a faster simulated system environment have been recently proposed. OpenAI Gym (brockman2016openai) supports an environment for teaching agents everything from walking to playing games like Pong or Pinball. Iroko (ruffy2018iroko) provides a data center emulator to understand the requirements and limitations of applying RL in data center networks. It interfaces with the OpenAI Gym and offers a way to evaluate centralized and decentralized RL algorithms against conventional traffic control solutions.

Park (mao2019park) proposes an open platform for easier formulation of the RL environment for twelve real world system optimization problems with one common easy to use API. The platform provides a translation layer between the system and the RL environment making it easier for RL researchers to work on systems problems. With that being said, the framework lacks the ability to change the action, state and reward definitions, making it harder to improve the performance by easily modifying these definitions.

\@xsect

In this section, we present a set of essential metrics that make the case for applying deep RL to a system optimization problem. These metrics can help researchers evaluate the deep RL solution in system optimization problems.

\@ssect

Is the MDP Clearly Defined? By formal definition from dynamical systems theory, the problem of RL is the optimal control of incompletely known MDP. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards but future states and rewards. This involves delayed rewards and trade off between delayed and immediate reward. In MDPs, the new state and new reward are dependent only on the preceding state and action. Given a perfect model of the environment an MDP can compute the optimal policy.

MDPs should be a straightforward formulation of the system problem as an agent learning from continually interacting with the system to achieve a particular goal, and the system responds to these interactions with a new state and reward. The agent’s goal is to maximize over time rewards.

\@ssect

Is It a Reinforcement Learning Problem? What distinguishes RL from other machine learning approaches is the presence of self exploration and exploitation, and the tradeoff between them. For example, RL is different from supervised learning. The later is learning from a training set with labels provided by external supervisor that is knowledgeable. For each example the label is the correct action the system should take. The objective of this kind of learning is to act correctly in new situations not present in the training set. However, supervised learning is not suitable for learning from interaction as often it is impractical to obtain examples representative of all the cases in which the agent has to act.

\@ssect

Are the Rewards Delayed? RL algorithms do not maximize the immediate reward of taking actions but the over time reward. For example, an RL agent can take actions that give low intermediate rewards but overall higher than taking a greedy action in every step. If the objective is to maximize the immediate reward or the actions are not dependent, then other simpler approaches such as bandit and greedy algorithms will perform better than deep RL as their objective is to maximize the immediate reward.

\@ssect

What is Being Learned? It is important to provide insights on what is being learned by the agent. For example, which actions are taken in which states and why? Can the knowledge learned be applied to new states/tasks? Is there a structure of the problem being learned? If a brute force solution is possible for simpler tasks, it will also be helpful to show how far the performance of the RL agent is from the brute force solution. In some cases, not all the hand engineered features are useful. This results in high variance and prolonged training. Feature analysis can help overcome this challenge. For example, in (coons2008feature) significant performance gaps were shown for different feature selection.

\@ssect

Does it Outperform Random Search and a Bandit Solution? In some cases the RL solution is just another form of a smart random search. In many cases, the good RL results were achieved merely by luck. Random search might perform as good as RL, or even better as it is less complicated. For example, in (haj2019autophase), the authors showed 10% improvement over the baseline by using random search.

In some cases the actions are independent and a greedy or a bandit solution can achieve the optimal or near optimal solution. Using a bandit method is equivalent to a 1-step RL solution and thus the objective is to maximize the immediate reward. Maximizing the immediate reward could also deliver the overall maximum reward and thus a comparison against a bandit solution can help unfold this and perhaps show that the problem is not an MDP.

\@ssect

Are the Expert Actions Observable? In some cases it might be possible to have access to expert actions, i.e., optimal actions. For example, if a brute force search is plausible and practical then it is possible to outperform deep RL by using it or using imitation learning (schaal1999imitation), which is a supervised learning approach that learns by imitating the expert actions.

\@ssect

Is It Possible to Reproduce/Generalize the Good Results? The learning process in deep RL is stochastic and thus the good results are sometimes achieved due to local maximum, simple task, and luck. In (haarnoja2018soft) different results were generated by just changing the random seeds. In many cases the good results cannot be reproduced by retraining, training on new tasks or generalizing to new tasks.

\@ssect

Does It Outperform State of the Art? The most important metric in the context of system optimization in general, is outperforming the state of the art. Improving state of the art includes different objectives such as, efficiency, performance, throughput, bandwidth, fault tolerance, security, utilization, reliability, robustness, complexity, and energy. If the proposed approach does not perform better than state of the art in some metric then it is hard to justify using it. Frequently, the state of the art solution is also more stable, practical, and reliable than deep RL. In many prior works listed in Table 3 a comparison against state of the art is not available or deep RL performs worse. In some cases deep RL can perform as good as state of the art or slightly worse, but still be a useful solution as it achieves an improvement in other metrics.

\@xsect

Multiple policies and network models can be used. RL frameworks like RLlib (liang2017ray), Intel’s Coach (caspi_itai_2017_1134899), TensorForce (tensorforce), Facebook Horizon (gauci2018horizon), and Google’s Dopamine (castro18dopamine) can help the users pick the right RL model as they provide implementations of many policies and models for which a convenient interface is available.

As a rule of thump, we rank RL algorithms based on sample efficiency as follows: model-based approaches (most efficient), temporal difference methods, PG methods, and evolutionary algorithms (least efficient). In general, many RL environments run in a simulator. For example (paliwal2019regal; mao2019park; mao2016resource), run in a simulator as the real environment’s step would take minutes or hours, which significantly slows down the training. If this simulator is fast enough or training time is not constrained then PG methods can perform well. If the simulator is not fast enough or training time is constrained then temporal difference methods can do better than PG methods as they are more sample efficient.

If the environment is genuine then temporal difference can do well if interaction with the environment is not slow while model based RL performs better if the environment is slow. Model based methods require a model of the environment (that can be learned) and rely on planning rather than learning. Since planing is not done in the actual environment but as much faster steps in the model of that environment, it requires the less samples from the real environment to learn. With that being said, model free methods are often used as they are simpler to deploy and have the potential to generalize better from exploration in the real environment.

If good policies are easy to find and the space of policies is small enough or if time is not a bottleneck for the search, then evolutionary methods can be effective. Evolutionary methods also have advantages when the learning agent cannot observe the complete state of the environment. Bandit solutions are good if the problem can be viewed as a one step RL problem.

PG methods in general are more stable than methods that do not use and derive a neural network to represent the agent’s policy itself such as Q-learning. The greedy nature of directly deriving the policy and moving the gradient in the direction of the objective make PG methods more stable, easier to reason about, and reliable.

\@xsect

In this section, we discuss the primary challenges that face the application of deep RL in system optimization.

\@xsect

Unlike the case with simulated environments that can run fast, when running on a real system, performing an action can trigger reward received after a long delay. For example, when scheduling jobs on a cluster of nodes, some jobs might require hours to run, and thus improving their performance by monitoring job execution time will be very slow. To speed up this process, some works used simulators as cost models instead of the actual system. These simulators often do not fully capture the actual behavior of the real system and thus the RL agent might not work well when used in practice after being trained in a simulated environment. More comprehensive environment models can alleviate the generalization from simulated environments. RL methods that can be more sample efficient, will speed up the training in the real system environment.

Instability and High Variance. This is a common problem which leads to bad policies when tackling system problems with deep RL. Such policies can generate a large performance gap when trained multiple times and behave in an unpredictable manner. This is mainly due to poor formulation of the problem as an RL problem, limited observation of the state, i.e., the used embedding and input features are not sufficient/meaningful, and sparse or similar rewards. Sparse rewards can be due to bad reward definition or the fact that some rewards cannot be computed directly and are known only by the end of the episode. For example, in (liang2019neural), where deep RL is used to optimize decision trees for packet classification, the reward (the performance of the tree) is known only after up to 15,000 steps when the whole tree is built. In some cases using more robust and stable policies can help. For example, Q-learning is known to have good sample efficiency but unstable behavior. SARSA, double Q-learning (van2016deep) and policy gradient methods on the other hand are more stable. Subtracting a bias in PG can also help reduce variance (greensmith2004variance).

\@xsect

This is often the case with most recent system optimization works that rely on deep RL in their solutions. It becomes difficult to reproduce the results due to restricted access to the resources, code and workloads used, lack of a detailed list of the used network hyperparameters and lack of stable, predictable, and scalable behavior of the different RL algorithms. This challenge prevents future deployment, incremental improvements, and proper evaluation.

\@xsect

This is a major challenge since without proper definitions of states, actions, and rewards, the RL solution is not useful. In many cases, the rewards are sparse or similar, the states are not fully observable to capture the whole system state and have limited features that capture only a small portion of the system state. This results in unstable and inadequate policies. Generally, the action and state spaces are large, requiring a lot of samples to learn and resulting in instability and large variance in the learned network. Therefore, retraining often fails to generate the same results.

\@xsect

The lack of generalization is an issue that deep RL solutions often suffer from. This might be beneficial when learning a particular structure. For example, in NeuroCuts (liang2019neural), the target is to build the best decision tree for fixed set of predefined rules and thus the objective of the RL agent is to find the optimal fit for these rules. However, lack of generalization sometimes results in a solution that works for a particular workload or setting but at the overall system performance it is not good. This problem manifests when generalization is important and the RL agent has to deal with new states that it did not visit in the past. For example, in (paliwal2019regal; addanki2019placeto) where the RL agent has to learn good resource placements for different computation graphs, the authors avoided the possibility of learning only good placements for particular computation graphs by training and testing on a wide range graphs.

\@xsect

The lack of standardized benchmarks, frameworks and evaluation metrics makes it very difficult to evaluate the effectiveness of the deep RL methods in the context of system optimization. In some cases, the promising results were achieved by mere luck. Thus, it is crucial to have proper standardized frameworks and evaluation metrics that define success. Moreover, we need benchmarks that enable proper training, evaluation of the results, measuring the generalization of the solution to new problems and performing valid comparisons against baseline approaches.

\@xsect

To put all the metrics and challenges together, we discuss DeepRM (mao2016resource) as an illustrative example. In DeepRM, the targeted system problem is focused on resource allocation in the cloud. The objective is the job slowdown, i.e.,, the goal is to minimize the wait time for all jobs. DeepRM uses PG in conjunction with a simulated environment rather than a real cloud environment. This significantly improves the step time but can result in restricted generalization when used in the real environment. Furthermore, since all the simulation parameters are known, a fully observable state of the simulated environment can be captured. The actions were defined as which job should be scheduled next. The state is defined as the current allocation of cluster resources, as well as the resource profiles of jobs waiting to be scheduled. The reward is defined as the sum of of job slowdowns: Σi(-1Ti) where Ti is the pure execution time of job i without considering the wait time. This reward basically gives a penalty of -1 for jobs that are waiting to be scheduled. The penalty is divided by Ti to give a higher priority to shorter jobs.

The state, actions and reward clearly define an MDP and a reinforcement learning problem. Specifically, the agent interacts with the system by making sequential allocations, observing the state of the current allocation of resources and receiving delayed long term rewards as overall slow downs of jobs. The rewards are delayed because the agent cannot know the effect of the current allocation action on the overall slow down, of all the jobs, at any particular time step. Thus, the agent would have to wait until all the other jobs get allocated. The agent then learns which jobs to allocate in the current time step to minimize the average job slowdown, given the current resource allocation in the cloud. Note that DeepRM also learns to withhold larger jobs to make room for smaller jobs to reduce the overall average job slowdown. DeepRM is shown to outperform random search.

The expert actions are not observable in this problem as there are no methods to find the optimal allocation decision at any particular time step. During training in DeepRM, multiple examples of job arrival sequences were considered as to see if the policy generalizes and makes reliable decisions11 1 Results provided were in the simulated system, not in the real system.. DeepRM is also shown to outperform the state-of-the-art by 1063%1.

Clearly, in the case of DeepRM, most of the challenges mentioned in Section id1 are manifested. The interaction with the real cloud environment is slow and thus the authors opted for a simulated environment. This has the advantage of speeding up the training but often could result in a policy that does not generalize to the real environment. Unfortunately, generalization tests in the real environment were not provided. The instability and high variance were addressed by subtracting a bias in the PG equation. The bias was defined as the average of job slowdowns taken at a single time step across all episodes. The implementation of DeepRM was open sourced allowing others to reproduce the results. The rewards, actions, and states defined allowed the agent to learn a policy that performed well in the simulated environment. Note that defining the state of the system was easier because the environment was simulated. The solution also considered multiple reward definitions. For example, -|J|, where J is the number of unfinished jobs in the system. This reward definition optimizes the average job completion time. The jobs evaluated in DeepRM were considered to arrive online according to a Bernoulli process. In addition, the jobs were chosen randomly and it is unclear whether they represent real workload scenarios or not. This emphasizes the necessity for standardized benchmarks and frameworks to evaluate the effectiveness of the deep RL methods in scheduling jobs in the cloud.

\@xsect

We believe multiple future directions for the deployment of deep RL in system optimization tasks. The general assumption is that deep RL might be useful in every system problem where the problem could be formulated as a sequential decision process where meaningful action, state, and reward definitions could be provided. The objective of deep RL in such systems could span a wide range of options, such as, energy efficiency, power, reliability, monitoring, revenue, performance, and utilization. At the processor level, deep RL could be used in branch prediction, memory prefetching, caching, data alignment, garbage collection, thread/task scheduling, power management, reliability, monitoring. Compilers could also benefit in optimizing the order of passes (optimizations), knobs/pragmas, unrolling factors, memory expansion, function inlining, vectorizing multiple instructions, tiling and instruction selection. With advancement of in and near memory processing, deep RL can be used to determine which portions of the workload should be performed in/near memory and which outside the memory.

At higher system level, deep RL could be used in SQL/pandas query optimization, cloud computing, scheduling, caching, monitoring (e.g., temperature/failure) and fault tolerance, packet routing and classification, congestion control, FPGA allocation, algorithm selection. While some of these works have already been done we believe there is a big potential for improvement. It is necessary to explore more benchmarks, stable and generalizable learners, transfer learning approaches, RL algorithms, model based RL and more importantly to provide better encoding of the states, actions and rewards to better represent the system and thus improve the learning. For example, with SQL/pandas join order optimization, the contents of the database are critical for determining the best order and thus somehow incorporating an encoding of these contents can further improve the performance.

There is room for improvement in the RL algorithms. Some action and state spaces can dynamically change with time. For example, when adding one more node to a cluster, the RL agent will always skip the added node and it will not be captured in the environment state. Generally, the state transition function of the environment is unknown the agent. Therefore, it is not guaranteed that if the agent takes a certain action, a certain state will follow in the environment. This issue was presented in (kulkarni2012mitigating), where compiler optimization passes were selected using deep RL. The authors mentioned a situation where the agent is stuck in an infinite loop of repeatedly picking the same optimization (action) back to back. This issue arose when a particular optimization did not change the features that describe the state of the environment, causing the neural network to apply the same optimization. To break this infinite loop, the authors limited the number of repetitions to five, and then instead, applied the second best optimization. This is done by taking the actions that corresponds to the second highest probability from the neural network’s probability distribution output.

\@xsect

In this work we reviewed, discussed multiple challenges in recent trends on applying deep reinforcement learning in system optimization, and proposed a set of metrics that can help evaluate the harness of deep reinforcement learning in a system optimization problem.

The recent trends of deep RL in system optimization were mainly in packet classification, congestion control, compiler optimization, scheduling, query optimization and cloud computing. Looking forward, we anticipate deep RL in system optimization will grow. We also believe there is much more room for improvement in both the deep reinforcement learning algorithms and the system problems that could be tackled with it.

References