Abstract
Recent research on SoftwareDefined Networking (SDN) strongly promotes theadoption of distributed controller architectures. To achieve high networkperformance, designing a scheduling function (SF) to properly dispatch requestsfrom each switch to suitable controllers becomes critical. However, existingliterature tends to design the SF targeted at specific network settings. Inthis paper, a reinforcementlearningbased (RL) approach is proposed with theaim to automatically learn a general, effective, and efficient SF. Inparticular, a new dispatching system is introduced in which the SF isrepresented as a neural network that determines the priority of eachcontroller. Based on the priorities, a controller is selected using ourproposed probability selection scheme to balance the tradeoff betweenexploration and exploitation during learning. In order to train a general SF,we first formulate the scheduling function design problem as an RL problem.Then a new training approach is developed based on a stateoftheart deep RLalgorithm. Our simulation results show that our RL approach can rapidly design(or learn) SFs with optimal performance. Apart from that, the trained SF cangeneralize well and outperforms commonly used scheduling heuristics undervarious network settings.
Quick Read (beta)
Effective Scheduling Function Design in SDN through Deep Reinforcement Learning
Abstract
Recent research on SoftwareDefined Networking (SDN) strongly promotes the adoption of distributed controller architectures. To achieve high network performance, designing a scheduling function (SF) to properly dispatch requests from each switch to suitable controllers becomes critical. However, existing literature tends to design the SF targeted at specific network settings. In this paper, a reinforcementlearningbased (RL) approach is proposed with the aim to automatically learn a general, effective, and efficient SF. In particular, a new dispatching system is introduced in which the SF is represented as a neural network that determines the priority of each controller. Based on the priorities, a controller is selected using our proposed probability selection scheme to balance the tradeoff between exploration and exploitation during learning. In order to train a general SF, we first formulate the scheduling function design problem as an RL problem. Then a new training approach is developed based on a stateoftheart deep RL algorithm. Our simulation results show that our RL approach can rapidly design (or learn) SFs with optimal performance. Apart from that, the trained SF can generalize well and outperforms commonly used scheduling heuristics under various network settings.
I Introduction
SoftwareDefined Networking (SDN), a newlyemerged network paradigm, is notable for decoupling the control logic from the data forwarding function and forming a logically centralized control plane. With the separation of control and data planes, SDN greatly simplifies network management and enables efficient network configuration. To improve the scalability of SDN, distributed controller architectures [4, 11] have become a notable invention where multiple controllers are jointly deployed for scalable request processing.
Apparently, high network performance depends on effective utilization of controller resources. This can be achieved by properly dispatching requests originated from every switch to suitable controllers chosen by a scheduling function (SF). Obviously, SF plays a vital role in the overall network performance. Motivated by this understanding, we aim to address the scheduling function design (SFD) problem in this paper.
Particularly, the designed SF must satisfy both the time efficiency requirement (${R}_{\mathrm{1}}$) and the generalization requirement (${R}_{\mathrm{2}}$). In view of the fact that request dispatching must be performed in real time with minimum delay, the designed SF needs to be sufficiently efficient in practice. Moreover, SDN networks can vary significantly in the number and capacities of controllers. Thus, the designed SF should perform consistently well over different network settings.
Existing studies have considered either manual or automated design of similar functions for scheduling and resource allocation [10, 14, 15]. Specifically, manually designed SFs such as weighted roundrobin and firstcome firstserve have been widely used in operating systems and cloud computing [16]. Obviously, the process of designing useful SFs is timeconsuming and requires a high level of domain expertise. To address this difficulty, evolutionary computation (EC) techniques have been proposed to automatically design SFs for standard job shop scheduling problems [14, 15]. However, the evaluation process in EC is timeconsuming and costly since numerous randomly generated SFs must be extensively evaluated in either simulated or realworld environments.
Due to the above limitations, a new learning approach is highly desirable for our SFD problem. Recently, reinforcement learning (RL) has been successfully applied to various resource management problems [12, 21] and is considered to be a powerful paradigm for designing SFs with several key advantages. First, no domain knowledge of the environment is required. RL can automatically learn the optimal solution while interacting with the unknown dynamic environment through a trialanderror process. Second, RL can design new SFs mainly based on experiences/data obtained from an old SF through a technique known as experience replay [13]. Thus, in comparison to an EC approach, the cost of training any new SFs can be greatly reduced. Third, the scheduling problem under a specific network setting can be naturally formulated as a Markov Decision Process (MDP) (detailed discussion can be found in Section IIIB), aiming to find an optimal policy using existing RL algorithms. In particular, a policy is a mapping from network states to a dispatching decision.
Despite the clear advantages offered by RL, several major issues must be addressed. (1) The representation of a general policy remains a challenge. Typically, a policy can be directly represented as a neural network (NN) with fixed numbers of output nodes and each node represents one particular controller in the network. Such a representation apparently violates ${R}_{\mathrm{2}}$ since the same policy is expected to function effectively in networks with different numbers of controllers. Moreover, as the number of controllers increases, the NN inevitably increases its complexity, which leads to long computational time to make a scheduling decision, potentially violating ${R}_{\mathrm{1}}$ too. (2) It is difficult to maintain a good balance between exploration and exploitation for effective RL. Particularly, when multiple controllers are deployed, we expect to select each controller with a certain probability instead of deterministically choosing one controller. In such a way we can explore and learn the benefit of using each controller. However, this could easily cause performance degradation without carefully controlling the level of exploration. As far as we know, none of the existing works have considered and solved the above issues.
In this paper, a new RLbased SFD method is proposed with the aim to automatically learn an effective and efficient SF for general use. The following contributions have been achieved. (1) Instead of representing the SF as an RL policy, a new dispatching system is proposed as a practical implementation of an RL policy, in which the SF is represented as an NN taking the states of each individual controller as input and outputting its \textquotepriority. (2) Given the priorities of all controllers, a probability selection scheme is proposed as part of the dispatching system to cope with the explorationexploitation dilemma. With the proposed selection scheme, only controllers with high priorities have the possibilities to be selected. As a consequence, the longterm network performance can be improved without sacrificing exploration whenever a deterministic controller selection scheme is adopted. (3) Along with the new dispatching system, a new training system is developed based on a stateoftheart actorcritic RL algorithm [19]. In particular, a new gradient calculation technique is derived for learning the SF. Apart from that, a new training scheme is proposed so as to constantly and adaptively improve the performance of the SF under a variety of network settings.
II Related Work
In recent years, distributed controller architectures [4, 11] have been widely adopted in SDN to enhance the network performance. Although multiple controllers can be deployed in the control plane, the network performance still heavily relies on effective utilization of the controller resources. Thus, designing an effective and efficient SF for request dispatching is of great importance.
In the literature, there exist SFs in the form of heuristics designed by human experts. For example, a weighted roundrobin heuristic is designed to proportionally forward requests to controllers based on their processing capacities. BLAC [10] randomly sampled a small number of controllers and sent requests to the least loaded one. Similar approaches can also be found in literature [8, 22]. Although such manually designed SFs are intuitive and simple in nature, the design process is timeconsuming and requires substantial domain knowledge. Moreover, the performance of manually designed SFs could also vary significantly, depending on specific network settings.
To address these limitations, EC techniques have been widely applied for automatically designing SFs. For instance, Su et al. [14] proposed a multiobjective Genetic Programming (GP) approach for handling dynamic job shop scheduling problems. Although promising results have been obtained in the literature, these SFs are generally designed offline and cannot easily and quickly adapt to the neverending changes in the network environment [5]. Besides, each newly evolved SF must be extensively tested in either simulated or realworld environments, which is timeconsuming and costly.
Recently, a completely different design approach based on RL has been studied in the literature [12, 21, 6]. Tesauro et al. [21] proposed an RLbased approach to automatically allocate the server resources in data centers. DeepRM [12] tackled the multiresource cluster scheduling problem using policy search to optimize various objectives, e.g., average job completion time and resource utilization. Chinchali et al. [6] leveraged the delaytolerant feature of IoT traffic and developed an RLbased scheduler to handle traffic variation so that the network utilization can be constantly optimized.
All these RLbased methods assume that the policy is represented by an NN. The dimension of the outputs is fixed and essentially equal to the number of controllers in an SDN network. However, when the network environment changes, e.g., more controllers are added due to the traffic growth, the trained policy is no longer applicable. Therefore, existing RL methods cannot be directly utilized to solve our SFD problem.
In view of above reasons, a new RLbased approach is proposed. According to our discussion in Section I, our approach satisfies both ${R}_{\mathrm{1}}$ and ${R}_{\mathrm{2}}$ and has the key strength of leveraging RL to effectively schedule requests within the SDN network so as to optimize the network performance.
III Understanding the SFD Problem
In this section, we will introduce the SFD problem by discussing the key concepts and modeling the problem as an MDP that lays the foundation of an RLbased solution.
IIIA The SFD Problem in SDN
Before introducing the SFD problem, we first introduce the network environment where the SF is applied. To ease discussion, let us consider an SDN network composed of ${N}_{c}$ controllers and ${N}_{s}$ switches. Specifically, the processing capacities of the ${N}_{c}$ controllers can be captured through $\bm{\alpha}=[{\alpha}_{1},\mathrm{\dots},{\alpha}_{{N}_{c}}]$ where ${\alpha}_{{n}_{c}}({n}_{c}=1,\mathrm{\dots},{N}_{c})$ represents the maximum number of requests that controller ${C}_{{n}_{c}}$ can process within a second. Packets arrive at switches constantly in the data plane. Note that when a new packet arrives at a switch, the switch will generate a request and pass it to a controller for processing. The packet arrival rate, therefore, is identical to the rate at which requests are generated by switches. We assume that packets arrive randomly at switches with respective arrival rates $\bm{\lambda}=[{\lambda}_{1},\mathrm{\dots},{\lambda}_{{N}_{s}}]$. Similar to existing works [10, 23], we assume that the time for processing each request by the same controller is roughly identical. However, it should also be noted that since controllers have different capacities, the processing time would change from one controller to another.
Once a request is generated at a switch, it will be immediately forwarded to a controller for processing with the help of our dispatching system. Since requests are generated (and dispatched) at different time, the dispatching of each request is considered as a separate time step of our dispatching system. This assumption can be flexibly supported with different implementation of the dispatching system. For example, we can install a separate and identical dispatching system on each switch to handle its request. The communication delay between the switches and controllers can be described by a matrix $\bm{D}$, where each element ${D}_{{n}_{s}}^{{n}_{c}}$ represents the delay between switch ${S}_{{n}_{s}}$ and controller ${C}_{{n}_{c}}$.
After processing any request, the controller will send a response back to the corresponding switch. The time interval measured by the switch between sending a request and receiving the response is defined as the request response time $\tau $. Apart from request processing, controllers will periodically report their status $\bm{u}=[{u}_{1},\mathrm{\dots},{u}_{{N}_{c}}]$ in terms of current resource utilization to all switches. Without loss of generality, we also assume that each controller maintains a request queue and processes requests in an FIFO manner [10].
With a properly designed SF, we expect to reduce the average request response time. To achieve this, ${R}_{\mathrm{1}}$ is crucial to avoid potential network performance degradation. We consider that the NN can meet ${R}_{\mathrm{1}}$ because small feedforward NNs can be quickly processed with the support of efficient and performanceoptimized software (e.g., TensorFlow [3]) and dedicated processing chips [1]. In addition to ${R}_{\mathrm{1}}$, a carefully designed SF must generalize well (i.e., ${R}_{\mathrm{2}}$). Since the cost of evaluation or training an SF in a production network can be high, the SF needs to be evaluated or trained in advanced. Note that different SDN networks can vary significantly in terms of number and capacities of controllers. Even within the same network, the number of controllers may change dynamically to accommodate the traffic fluctuation. Thus, a generally applicable SF should be able to immediately cope with these variations. To achieve this goal, we need to first model an SFD problem as an MDP.
IIIB Modeling the SFD Problem as an MDP
An MDP is usually described by a 4tuple $(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R})$. At each time step $t$, an agent observes its current state ${\bm{s}}_{t}\in \mathcal{S}$ while interacting with an unknown environment and takes an action ${a}_{t}\in \mathcal{A}$ chosen from a policy ${\pi}_{\bm{\theta}}$. A policy ${\pi}_{\bm{\theta}}$ is often considered as a parametric function of $\bm{\theta}$, which maps $\mathcal{S}$ to a probability distribution over $\mathcal{A}$. After performing ${a}_{t}$, the agent receives a reward given by the reward function $\mathcal{R}({\bm{s}}_{t},{a}_{t})$ and enters the next state ${\bm{s}}_{t+1}$ decided by the state transition probabilities $\mathcal{P}({\bm{s}}_{t+1}{\bm{s}}_{t},{a}_{t})$. The goal for RL in a finite horizon $T$ is to learn the policy ${\pi}_{\bm{\theta}}$ so as to maximize the expected longterm cumulative reward defined below:
$${V}_{{\pi}_{\bm{\theta}}}(\bm{s})=E\{\sum _{t=0}^{T}\mathcal{R}({\bm{s}}_{t},{a}_{t}){\bm{s}}_{0}=\bm{s},{\pi}_{\bm{\theta}}\}$$  (1) 
The SFD problem can be naturally formulated as an MDP. Specifically, the network reaches a new time step $t$ is defined whenever one new request is generated by a switch in the data plane. At each time step $t$, ${\bm{s}}_{t}$ contains all current and historical network information, such as $\bm{\lambda}$ and $\bm{u}$; ${a}_{t}$ is the controller selected by the policy ${\pi}_{\bm{\theta}}$ and ${\pi}_{\bm{\theta}}$ is the dispatching system (Fig.1) to be introduced in Section IV. After dispatching the requests to the chosen controller, the network keeps operating until the next request is generated. At that moment the network enters the next state ${\bm{s}}_{t+1}$. In order to train ${\pi}_{\bm{\theta}}$ towards optimizing our objective (i.e., minimizing the average request response time), we define the reward as
$$\mathcal{R}({\bm{s}}_{t},{a}_{t})=\sum _{j\in {\mathcal{J}}_{t}}\frac{1}{{\tau}_{j}}$$  (2) 
where ${\tau}_{j}$ is the response time of request ${r}_{j}$ and ${\mathcal{J}}_{t}$ stands for the set of requests, for which the corresponding response from controllers have been received by the respective switches in between two consecutive time steps $t$ and $t+1$.
Clearly, each request ${r}_{j}$ contributes $\frac{1}{{\tau}_{j}}$ to the total reward. Guided by this reward, an RL algorithm is strongly motivated to receive more responses from controllers and to reduce the average response time simultaneously. By maximizing the cumulative reward in (3), we can therefore fulfill our goal of reducing the request response time and improving the network performance.
$$\underset{\bm{\theta}}{\text{max}}{V}_{{\pi}_{\bm{\theta}}}(\bm{s})=\underset{\bm{\theta}}{\text{max}}E\{\sum _{t=0}^{T}\sum _{j\in {J}_{t}}\frac{1}{{\tau}_{j}}{\bm{s}}_{0}=\bm{s},{\pi}_{\bm{\theta}}\}$$  (3) 
IV Proposed RLbased Approach for SFD
Our proposed RLbased approach for SFD consists of two major systems: the dispatching system in Fig. 1 and the training system in Fig. 2. In particular, the dispatching system, which is also known as the RL policy, chooses a suitable controller for each incoming packet so as to minimize the response time. On the other hand, the training system is in charge of optimizing the SF under varied network settings for general use.
IVA Dispatching System Design
By modeling the SFD problem as an MDP, RL algorithms can be utilized to design a general, efficient, and effective SF. However, as we mentioned in Section I, existing policy representation fails to meet both ${R}_{\mathrm{1}}$ and ${R}_{\mathrm{2}}$. Thus, a new policy representation must be designed. Inspired by the successful use of dispatching rules for supporting diverse job shop scheduling tasks [14, 15], we decide to employ an SF to determine the priority for each controller to process every incoming packet. Consequently, the same SF is capable of dispatching requests in a network with arbitrary number of controllers. Moreover, we adopt NN in this paper to improve the expressiveness and trainability of the SF. It is important to note that in job shop scheduling problems, the job with the highest priority will always be processed first according to the dispatching rule. In comparison, we choose to give more controllers nonnegligible opportunity of processing a request, thereby encouraging the exploration during the training process in hope of achieving higher performance.
In line with this idea, a new dispatching system has been designed in Fig. 1. The system consists of three modules  the state extractor $\phi $, the SF ${f}_{\bm{\theta}}$, and the probability projector $\gamma $. In particular, when a new request is generated at a switch at time $t$, the dispatching system will select a controller using the following steps: First of all, $\phi $ extracts the specific state information ${\bm{s}}_{{n}_{c},t}$ for each controller from the raw entire network information ${\bm{s}}_{t}$. Then the state information ${\bm{s}}_{{n}_{c},t}$ of each controller will be processed individually and sequentially by the SF ${f}_{\bm{\theta}}$ which outputs a corresponding priority value ${o}_{{n}_{c},t}$. Given all controllers’ priorities ${\bm{o}}_{t}=[{o}_{1,t},\mathrm{\dots},{o}_{{N}_{c},t}]$, $\gamma $ is activated to map ${\bm{o}}_{t}$ into dispatching probabilities ${\bm{p}}_{t}$. Based on ${\bm{p}}_{t}$, a controller is selected to process the new request.
Obviously, the output priority ${o}_{{n}_{c},t}$ of controller ${C}_{{n}_{c}}$ depends heavily on its state input ${\bm{s}}_{{n}_{c},t}$ as shown in Fig. 1. Thus, selecting suitable state information for each controller is critical. Intuitively, the preference of choosing controller ${C}_{{n}_{c}}$ relies on its history. If its response time ${\tau}_{{n}_{c}}$ or utilization ${u}_{{n}_{c}}$ dramatically increased recently, the controller is very likely to be heavily loaded in the near future. Sending requests to that controller is likely to result in long response time. Similarly, the request arrival rate ${\lambda}_{{n}_{s},t}$ at switch ${S}_{{n}_{s}}$ at time step $t$ should also be included because the controller’s future utilization is directly affected by the number of requests originated from switch ${S}_{{n}_{s}}$. Apart from that, the preference of choosing ${C}_{{n}_{c}}$ also depends on its processing capacity ${\alpha}_{{n}_{c}}$ and communication delay ${D}_{{n}_{s}}^{{n}_{c}}$. Therefore, the information mentioned above should all be included in ${C}_{{n}_{c}}$’s state information ${\bm{s}}_{{n}_{c},t}$.
Given the controller state ${\bm{s}}_{{n}_{c},t}$, ${f}_{\bm{\theta}}$ computes the controller’s priority ${o}_{{n}_{c},t}$ through an NN parameterized by $\bm{\theta}$ which will be optimized using an adapted RL algorithm elaborated in Section IVB and IVC. The calculated priorities of all controllers ${\bm{o}}_{t}=[{o}_{1,t},\mathrm{\dots},{o}_{{N}_{c},t}]$ are then forwarded to the next system module.
After receiving the priorities ${\bm{o}}_{t}$ from all controllers, the probability projector $\gamma $ maps the priorities into dispatching probabilities ${\bm{p}}_{t}=[{p}_{1,t},\mathrm{\dots},{p}_{{N}_{c},t}]$. One of the most widely used mappings is the softmax function [20]. However, we do not consider it as an appropriate option because a nonzero probability will always be assigned to a controller even though the controller is clearly not a suitable candidate for processing the pending request (e.g., the controller is too far away from a switch or has very low capacity). This will inevitably result in network performance degradation. On the other hand, a deterministic approach which always selects the controller with the highest priority is also inappropriate. Because purely exploiting the currently best controller prevents an RL algorithm from exploring other suitable controller candidates that can bring more benefits in reducing the average response time.
To balance the tradeoff between exploration and exploitation, an Euclidean projection method [24] is adopted here. In particular, we define ${\stackrel{\mathbf{~}}{\bm{o}}}_{t}=[{\stackrel{~}{o}}_{1,t},\mathrm{\dots},{\stackrel{~}{o}}_{{N}_{c},t}]$ as the normalized priorities of ${\bm{o}}_{t}$ and is sorted in descending order, where ${\stackrel{~}{o}}_{i,t}$ represents the normalized priority of the controller with the ${i}^{th}$ highest priority. Note that ${\bm{o}}_{t}$ is the output of the NN which is guaranteed to be nonnegative by using a softplus activation function in the output layer. We expect to compute the Euclidean projection of ${\stackrel{\mathbf{~}}{\bm{o}}}_{t}$ to ${\bm{p}}_{t}$ so that the Euclidean distance between ${\stackrel{\mathbf{~}}{\bm{o}}}_{t}$ and ${\bm{p}}_{t}$ can be minimized.
Existing studies [24] showed that the Euclidean projection problem can be solved by assigning nonzero probabilities to the $m$ controllers with the highest priorities while setting $0$ probabilities to the remaining controllers. The exact solution to the Euclidean projection problem is further given as follows:
$${p}_{i,t}=\{\begin{array}{cc}{\stackrel{~}{o}}_{i,t}+\frac{1}{m}(1{\sum}_{j=1}^{m}{\stackrel{~}{o}}_{j,t}),\text{if}\mathrm{\hspace{0.33em}1}\le i\le m\hfill & \\ 0,\text{otherwise}\hfill & \end{array}$$  (4) 
where $m$ is a hyperparameter for our dispatching system.
With (4), the requests will be always dispatched to \textquoteappropriate controllers (i.e., controllers with high priorities) so as to improve the longterm network performance. Furthermore, by manipulating the value of $m$, we can explicitly control the level of exploration.
IVB Adapting PPO to Train the SF
Among existing RL algorithms, proximal policy optimization (PPO) is selected for training the SF because of several reasons. First of all, PPO can perform multiple epochs of minibatch policy update using previously sampled data, greatly improving sample efficiency. Secondly, PPO employs only firstorder optimization which is more computationally efficient compared to other RL algorithms [17]. Finally, PPO has been widely and successfully used in many problem domains. Studies [19] have shown that PPO can outperform many stateoftheart algorithms such as TRPO [17] and A2C [13] on many difficult RL problems. It is particularly effective at training functions modeled as deep NNs. While we only use PPO in this paper, our research does not rule out the possibilities of using other RL algorithms.
As a prominent and highly efficient actorcritic algorithm, PPO uses an NN to approximate the value function which is then used to train the SF. In order to apply PPO, an NN denoted as ${f}_{\bm{\omega}}$ parameterized by $\bm{\omega}$ representing the value function is required. In line with ${R}_{\mathrm{2}}$, the number of inputs to ${f}_{\bm{\omega}}$ should not change even when the network setting alters. In our simulation studies, we found it useful to feed ${f}_{\bm{\omega}}$ with highlevel statistics ${\bm{s}}_{t}^{\prime}$ that accurately capture the performance and operation of our SDN network in the recent past. Thus, in our experimental study, the inputs ${\bm{s}}_{t}^{\prime}$ for ${f}_{\omega}$ contain the total control plane capacity, the weighted average communication delay, the overall request arrival rate, and the recent history of both average response time and the control plane utilization.
Given the value function ${f}_{\bm{\omega}}$, PPO obtains the optimal policy ${\pi}_{\bm{\theta}}$ by maximizing the following clipping function:
$$C=\underset{\bm{\theta}}{\text{max}}E\left\{\text{min}({r}_{t}(\bm{\theta}){A}_{t},\text{clip}({r}_{t}(\bm{\theta}),1\u03f5,1+\u03f5){A}_{t})\right\}$$  (5) 
where $\u03f5$ is a hyperparameter (e.g., $0.2$), ${A}_{t}$ is the advantage function that can be estimated through ${f}_{\bm{\omega}}$ [18], and ${r}_{t}(\bm{\theta})=\frac{{\pi}_{\bm{\theta}}({a}_{t}{\bm{s}}_{t})}{{\pi}_{{\bm{\theta}}_{old}}({a}_{t}{\bm{s}}_{t})}$. It is straightforward to see that the policy ${\pi}_{\bm{\theta}}$ can be improved by repeatedly updating the policy parameters $\bm{\theta}$ along the direction of $\frac{\partial C}{\partial \bm{\theta}}$.
In particular, for any state ${\bm{s}}_{t}$, the gradient $\frac{\partial C}{\partial \bm{\theta}}$ can be calculated as below:
$$  (6) 
Note that ${\pi}_{\bm{\theta}}$ in Fig. 1 combines the mapping function $\gamma $ described in (4) and the NN ${f}_{\bm{\theta}}$. Therefore, calculating $\frac{\partial C}{\partial \bm{\theta}}$ requires extra effort. A new technique is developed in this paper to calculate $\frac{\partial C}{\partial \bm{\theta}}$ which will be illustrated by the following example.
Gradient Calculation Example: In this example, we consider a network with $3$ controllers and set the hyperparameter $m$ in (4) to be $2$. Assume that the $3$ controllers’ priorities at time step $t$ is ${\bm{o}}_{t}=[{o}_{1,t},{o}_{2,t},{o}_{3,t}]$ and ${o}_{1,t}\ge {o}_{2,t}\ge {o}_{3,t}$. Then the sorted and normalized priorities ${\stackrel{\mathbf{~}}{\bm{o}}}_{t}$ can be represented as
$${\stackrel{\mathbf{~}}{\bm{o}}}_{t}=[\frac{{o}_{1,t}}{{\sum}_{j=1}^{3}{o}_{j,t}},\frac{{o}_{2,t}}{{\sum}_{j=1}^{3}{o}_{j,t}},\frac{{o}_{3,t}}{{\sum}_{j=1}^{3}{o}_{j,t}}]$$  (7) 
The dispatching probabilities in (4) can be determined as:
$${p}_{i,t}=\{\begin{array}{cc}{\stackrel{~}{o}}_{i,t}+0.5\times {\stackrel{~}{o}}_{3,t}=\frac{{o}_{i,t}+0.5\times {o}_{3,t}}{{\sum}_{j=1}^{3}{o}_{j,t}},i=1,2\hfill & \\ 0,i=3\hfill & \end{array}$$  (8) 
Following the chain rule, the gradient of ${\pi}_{\bm{\theta}}$ can be calculated as below if the controller with the ${i}^{th}$ highest priority is selected:
$\begin{array}{cc}& {\displaystyle \frac{\partial {\pi}_{\bm{\theta}}({\bm{s}}_{t},{a}_{t})}{\partial \bm{\theta}}}={\displaystyle \frac{\partial {p}_{i,t}}{\partial {o}_{1,t}}}{\displaystyle \frac{\partial {o}_{1,t}}{\partial \bm{\theta}}}+{\displaystyle \frac{\partial {p}_{i,t}}{\partial {o}_{2,t}}}{\displaystyle \frac{\partial {o}_{2,t}}{\partial \bm{\theta}}}+{\displaystyle \frac{\partial {p}_{i,t}}{\partial {o}_{3,t}}}{\displaystyle \frac{\partial {o}_{3,t}}{\partial \bm{\theta}}}\hfill \\ & ={\displaystyle \frac{\frac{\partial {o}_{i,t}}{\partial \bm{\theta}}+0.5\frac{\partial {o}_{3,t}}{\partial \bm{\theta}}}{{\sum}_{j=1}^{3}{o}_{j,t}}}{\displaystyle \frac{\left({o}_{i,t}+0.5{o}_{3,t}\right)\left(\frac{\partial {o}_{1,t}}{\partial \bm{\theta}}+\frac{\partial {o}_{2,t}}{\partial \bm{\theta}}+\frac{\partial {o}_{3,t}}{\partial \bm{\theta}}\right)}{{\left({\sum}_{j=1}^{3}{o}_{j,t}\right)}^{2}}}\hfill \end{array}$  (9) 
where $\frac{\partial {o}_{{n}_{c},t}}{\partial \bm{\theta}}=\frac{\partial {f}_{\bm{\theta}}}{\partial \bm{\theta}}$ is the gradient of the SF ${f}_{\bm{\theta}}$ given the controller state ${\bm{s}}_{{n}_{c},t}$.
Note that this new technique for calculating the derivative can be easily extended to the case with arbitrary number of controllers. With the help of TensorFlow [3], the derivative calculation can also be fully automated in our training system, regardless of how many controllers are involved.
IVC Training System Design
In this section, we will discuss how to simultaneously train both the SF ${f}_{\bm{\theta}}$ and the value function ${f}_{\bm{\omega}}$.
To understand the training process, we first define a network setting as a 3tuple $\text{\mathbf{E}\mathbf{n}\mathbf{v}}=(\bm{\alpha},\bm{\lambda},\bm{D})$ including the controller capacities, packet arrival rates, and communication latencies. An episode is further defined as one network simulation based on a specific setting Env, which starts from an initial state where no packets have started to flow through the network and ends when the simulation time reaches a predefined value ${t}_{max}$. Simulation details will be provided in Section V.
As shown in Algorithm 1, the training is performed in a sequential manner. In particular, ${N}_{set}$ network settings $[{\text{\mathbf{E}\mathbf{n}\mathbf{v}}}_{1},\mathrm{\dots},{\text{\mathbf{E}\mathbf{n}\mathbf{v}}}_{{N}_{set}}]$ will be simulated in sequence. For each Env, PPO is used to train the SF for ${N}_{ep}$ episodes. Within each episode, multiple learning iterations are performed.
In particular, a learning iteration consists of $n$timestep simulation. As shown in Fig. 2, within each learning iteration, the dispatching system equipped with ${f}_{{\bm{\theta}}_{old}}$ is used for dispatching requests in the network simulator for $n$ time steps. After collecting the corresponding information including states, actions, and rewards from the simulated network, PPO is further activated to train the SF.
V Simulation
This section reports the performance evaluation of the proposed RLbased approach for SFD.
Simulation Setting: In our simulation, we adopt the same NN architecture given in PPO [19] for both the SF ${f}_{\bm{\theta}}$ and value function ${f}_{\bm{\omega}}$, which is a fully connected multilayer perceptron with two hidden layers of 64 units and tanh activation. Meanwhile, we set the hyperparameters following the PPO Mujoco setting [19] and $m$ in (4) to be $2$. Each network setting is trained for $2$ episodes and each episode is initialized with $0\%$ utilization for all controllers and $0$ packets in the network. The requests are generated following the Poisson distributions with predefined arrival rates. Each episode runs for $240$ seconds which is assumed to be sufficiently long for the network to enter and stay in a stationary condition. For each episode, we set $1024$ simulation steps as one learning iteration.
During our simulation, the SF is trained with one switch since the learned SF will be individually deployed on each switch as we mentioned in IIIA and training the SF with one switch can also reduce the training costs. Although the SF is trained with one switch, it can effectively work in the network with multiple switches deployed, which will be demonstrated in the testing section.
Training Performance: We start with demonstrating the training performance when only one network setting (Env $1$ in Table I) is utilized for training. The learning curves showing the total reward and average response time averaged across five runs of the algorithm are demonstrated in Fig. 3. It can be observed from Fig. 3(a) that, the cumulative reward grows quickly from around $200,000$ to above $300,000$ within $60$s of the corresponding simulated network operation time. Similarly, a sharp decrease in average response time can also be observed from Fig. 3(b) which indicates that the trained SF can rapidly converge to the optimal performance.
Training  Testing  
Env 1  Env 2  Env 3  Env 4  Env 5  Env 6  Env 7  

5  4  6  15  20  15  25  
Num. controller ${N}_{c}$  2  2  2  3  4  

[9, 9]  [15, 6]  [9, 12]  [6, 9, 12]  [6, 9, 12, 15]  




\  \ 
In order to train a generally applicable SF, we then perform the training over a series of different network settings as described in Section IVC. In particular, $3$ network settings (Env $1$  Env $3$ shown in Table I) are used during the training process. The corresponding training performance is similar to Fig. 3, which is excluded here due to the space limit.
Testing Performance: To demonstrate the effectiveness of our proposed approach, we compare the trained SFs obtained at different training stages with two competing scheduling heuristics: random (Rand) and weighted roundrobin (Weighted RR) [10]. The topologies we use are South American (Env $4$ and Env $5$ in Table I) and Asian Sprint networks (Env $6$ and $7$ in Table I) [2]. Due to various network sizes, the number of controllers deployed in each network is different as shown in Table I. The locations of controllers are decided by kcenter [9]. Given the topology, the communication delay between any two nodes in the network is calculated using Dijkstra’s algorithm [7].
The simulation begins with Env 4 where $3$ controllers are deployed. The total number of requests generated by the entire data plane is average to be $15000$pkt/s. It can be seen from Fig. 4(a) that the response time of our SF is initially similar to Rand. However, as the SF is trained with more network settings, its response time significantly drops from $105$ms to $60$ms, which is $40\%$ lower than both Rand and Weighted RR. Similar conclusions can also be drawn from Fig. 4(b)(d) where different network settings are applied. This is expected because our SF takes both the controller capacity and communication delay into account during request scheduling. On the other hand, Rand evenly schedules requests regardless of the communication delay or the controller capacity. Compared with Rand, Weighted RR distributes requests based on the controller capacity and achieves slightly better performance in general. However, in the network that spans large geographic areas, the communication delay contributes a significant proportion to the average response time. Solely considering the controller capacity obviously cannot achieve good performance.
To verify this, we compare the request distributions of different approaches used in a switch in Env $6$. We can see from Fig. 5 that both Rand and Weighted RR schedule requests as we expected. On the other hand, instead of dispatching requests to all controllers as both Rand and Weighted RR do, our SF only schedules requests to controllers with low communication delay (i.e., Ctl1 and Ctl2) without overloading them, achieving the lowest response time ($30$ms).
Another interesting phenomenon we can notice from Fig. 4(c) is that Rand outperforms Weighted RR on response time in Env $6$. The most likely reason is that the controller with smaller capacity is placed nearer to the switches compared to the largercapacity one in the network. Thus, Rand sends more requests to nearby controllers, potentially reducing the response time. Apart from that, all three approaches achieve similar throughput in Fig. 4(a) and Fig. 4(c), which is expected since the overall arrival rate is still below the whole network capacity and none of the controllers is overloaded.
We also compare the performance of our trained SF with a different request arrival rate. As shown in Fig. 4(b), we notice that the throughput of Rand is $700$pkt/s smaller than Weighted RR due to controller overloading. In particular, the total arrival rate in Fig. 4(b) is $20000$pkt/s and each controller will evenly receive around $6666$pkt/s in Rand. Thus, the controller with $6000$pkt/s capacity will be inevitably overloaded, leading to the increase of response time and decrease of throughput. In comparison to Rand, the throughput of our trained SF increases as more network settings are trained, which effectively sends requests to near controllers without overloading them. Similar phenomenon can be seen in Fig. 4(d).
Furthermore, it should be noted that, although our SF is trained in networks with two controllers, our testing results clearly show that it can generalize well to networks with more controllers. Therefore, our RL approach can design general and efficient SFs.
VI Conclusions
To effectively utilize the multicontroller resources in SDN, it is of great importance to design an SF that dispatches requests from switches to appropriate controllers. Motived by this, we propose an RLbased approach to solve the SFD problem by automatically learning an effective and generally applicable SF. Specifically, we formulate the SFD problem as an RL problem and a dispatching system is developed where an SF is in the form of an NN to calculate the priority of every controller. After that, a specially designed selection scheme is applied to make the final dispatching decision using the obtained priorities. Along with the new dispatching system, a new training approach is developed which constantly improves the performance of the SF under different network settings via an adapted RL algorithm. Our simulation study showed that by using the newly proposed training approach, the SF can quickly converge to the optimal performance. Apart from that, the trained SF can generalize well and achieve significantly better performance compared with other heuristics.
References
 [1] “Google Tensor Processing Unit,” https://goo.gl/L4AJe3.
 [2] “Sprint,” https://www.sprint.net/performance/.
 [3] “TensorFlow,” https://www.tensorflow.org/.
 [4] P. Berde et al., “Onos: towards an open, distributed sdn os,” in Proceedings of the Third Workshop on Hot Topics in Software Defined Networking. ACM, 2014, pp. 1–6.
 [5] J. Branke, S. Nguyen, C. W. Pickardt, and M. Zhang, “Automated design of production scheduling heuristics: A review,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 1, pp. 110–124, 2016.
 [6] S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone, and K. Sachin, “Cellular network traffic scheduling with deep reinforcement learning,” in AAAI, 2018.
 [7] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische mathematik, vol. 1, no. 1, pp. 269–271, 1959.
 [8] A. A. Dixit, F. Hao, S. Mukherjee, T. Lakshman, and R. Kompella, “Elasticon: an elastic distributed sdn controller,” in Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Communications Systems. ACM, 2014, pp. 17–28.
 [9] B. Heller, R. Sherwood, and N. McKeown, “The controller placement problem,” in Proceedings of the first workshop on Hot topics in software defined networks. ACM, 2012, pp. 7–12.
 [10] V. Huang, Q. Fu, G. Chen, E. Wen, and J. Hart, “Blac: A bindingless architecture for distributed sdn controllers,” in 2017 IEEE 42nd Conference on Local Computer Networks (LCN). IEEE, 2017, pp. 146–154.
 [11] T. Koponen et al., “Onix: A distributed control platform for largescale production networks.” in OSDI, vol. 10, 2010, pp. 1–6.
 [12] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016, pp. 50–56.
 [13] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in ICML, 2016, pp. 1928–1937.
 [14] S. Nguyen, M. Zhang, M. Johnston, and K. C. Tan, “Automatic design of scheduling policies for dynamic multiobjective job shop scheduling via cooperative coevolution genetic programming,” IEEE Transactions on Evolutionary Computation, vol. 18, no. 2, pp. 193–208, 2014.
 [15] J. Park, Y. Mei, S. Nguyen, G. Chen, and M. Zhang, “Investigating a machine breakdown genetic programming approach for dynamic job shop scheduling,” in EuroGP. Springer, 2018, pp. 253–270.
 [16] P. Salot, “A survey of various scheduling algorithm in cloud computing environment,” International Journal of Research in Engineering and Technology, vol. 2, no. 2, pp. 131–135, 2013.
 [17] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in ICML, 2015, pp. 1889–1897.
 [18] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
 [19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [20] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
 [21] G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani, “A hybrid reinforcement learning approach to autonomic resource allocation,” in Autonomic Computing, 2006. ICAC’06. IEEE International Conference on. IEEE, 2006, pp. 65–73.
 [22] G. Wang, Y. Zhao, J. Huang, and Y. Wu, “An effective approach to controller placement in software defined wide area networks,” IEEE Transactions on Network and Service Management, vol. 15, no. 1, pp. 344–355, 2018.
 [23] T. Wang, F. Liu, J. Guo, and H. Xu, “Dynamic sdn controller assignment in data center networks: Stable matching with transfers,” in Proc. of INFOCOM, 2016.
 [24] W. Wang and M. A. CarreiraPerpinán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” arXiv preprint arXiv:1309.1541, 2013.