Abstract
Load imbalance pervasively exists in distributed deep learning trainingsystems, either caused by the inherent imbalance in learned tasks or by thesystem itself. Traditional synchronous Stochastic Gradient Descent (SGD)achieves good accuracy for a wide variety of tasks, but relies on globalsynchronization to accumulate the gradients at every training step. In thispaper, we propose eagerSGD, which relaxes the global synchronization fordecentralized accumulation. To implement eagerSGD, we propose to use twopartial collectives: solo and majority. With solo allreduce, the fasterprocesses contribute their gradients eagerly without waiting for the slowerprocesses, whereas with majority allreduce, at least half of the participantsmust contribute gradients before continuing, all without using a centralparameter server. We theoretically prove the convergence of the algorithms anddescribe the partial collectives in detail. Experimental results onloadimbalanced environments (CIFAR10, ImageNet, and UCF101 datasets) showthat eagerSGD achieves 1.27x speedup over the stateoftheart synchronousSGD, without losing accuracy.
Quick Read (beta)
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Abstract.
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eagerSGD, which relaxes the global synchronization for decentralized accumulation. To implement eagerSGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on loadimbalanced environments (CIFAR10, ImageNet, and UCF101 datasets) show that eagerSGD achieves 1.27$\times $ speedup over the stateoftheart synchronous SGD, without losing accuracy.
1
1. Motivation
Deep learning models are on a steep trajectory to becoming the most important workload on parallel and distributed computer systems. Early convolutional networks demonstrated groundbreaking successes in computer vision, ranging from image classification to object detection (Huang et al., 2017; Simonyan and Zisserman, 2014). More recent developments in recurrent and transformer networks enable impressive results in video classification, natural language processing for machine translation, question answering, text comprehension, and synthetic text generation. The latter models contain more than 1.5 billion parameters and take weeks to train (Devlin et al., 2018; Radford et al., 2018). Other demanding neural networks are trained on the largest supercomputers to achieve scientific breakthroughs (Mathuriya et al., 2018; Kurth et al., 2018). Furthermore, the models are growing exponentially in size, OpenAI is predicting a 10x growth each year (Amodei and Hernandez, 2018) potentially leading to artificial general intelligence. In order to support this development, optimizing the training procedure is most important.
The training procedure of deep learning is highly parallel but dominated by communication (BenNun and Hoefler, 2018). Most parallel training schemes use data parallelism where full models are trained with parts of the dataset and parameters are synchronized at the end of each iteration. The total size of allreduce grows with the model size, which ranges from a few megabytes (Huang et al., 2017) to several gigabytes (Radford et al., 2018) and grows quickly. The allreduce operation is not atomic and it can be split into layerwise reductions, which can easily be overlapped with the layer computation using nonblocking collectives (Hoefler et al., 2007; Awan et al., 2017). Yet, the optimal scaling of an allreduce of size $S$ is at best $\mathcal{O}\left(\mathrm{log}P+S\right)$ in $P$ processes (Patarasuk and Yuan, 2009; Hoefler and Moor, 2014; Renggli et al., 2018). Thus, growing process counts will reduce the parallel efficiency and eventually make the reduction a scaling bottleneck.
The communication aspects of deep learning have been investigated in many different contexts (Sergeev and Del Balso, 2018; Renggli et al., 2018), see the survey for an overview (BenNun and Hoefler, 2018). In this work, we identify load imbalance as an additional barrier to scalability. When some processes finish the computation later than others, all processes will wait for the last one at the blocking allreduce function. Load imbalance can be caused by the system itself, for example, when training on multitenant cloud systems (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010) or by system or network noise (Hoefler et al., 2009, 2010) in highperformance machines. A second, and more prominent cause of imbalance is inherent imbalance in the computation that causes varying load across different processes. While noise from the system is generally low on wellmaintained HPC machines (Hoefler et al., 2010), the inherent load imbalance of the training workloads cannot easily be avoided. Natural language processing tasks have sentences of highly varying length while video processing tasks have videos with different number of frames. For example, the training dataset of UCF101 (Soomro et al., 2012) contains videos that range from 29 to 1,776 frames.
Several researchers have shown that the training process itself is quite robust with respect to bounded errors. In fact, data augmentations such as Cutout (Devries and Taylor, 2017) and Dropout (Ba and Frey, 2013) introduce random errors and omissions into the training process to improve generalization properties. Several packages take advantage of this robustness and employ three techniques in tandem: (1) communicated weights are quantized to more compact number representations (Seide et al., 2014; Strom, 2015), (2) only the most significant weights are sent during each allreduce (Renggli et al., 2018; Alistarh et al., 2018), and (3) updates are only sent to limited (random) neighborhoods using gossip algorithms (Lian et al., 2018). We propose to exploit this robustness in a new way: we perform the allreduce eagerly in that we ignore the input gradients of processes that come late in order to not delay all processes. The communication partners are selected based on their workload (which can be randomized) and the allreduce itself is performed with highperformance reduction topologies (Hoefler and Moor, 2014) in logarithmic depth. We call our method eager Stochastic Gradient Decent (eagerSGD), as a counterpart to synchronous SGD (synchSGD) (BenNun et al., 2019; Sergeev and Del Balso, 2018; Awan et al., 2017). Fig. 1 shows the difference between synchSGD and eagerSGD.
Specifically, we propose to relax the allreduce operation to partial collectives in eagerSGD. A partial collective is an asynchronous operation where a subset of the processes can trigger and complete the collective operation. Absentee processes follow a predefined protocol, such as contributing potentially outdated data. We define two partial collectives — solo allreduce, a waitfree operation that one process triggers; and majority allreduce, in which the majority must participate.
Our theoretical analysis shows that solo allreduce does not guarantee bounded error, as necessary in SGD, yet empirically converges in cases of moderate load imbalance. Majority allreduce is proven to bound the error, but is not completely waitfree. The statistical guarantee, however, is sufficient to both train deep neural networks and avoid the delays. We show that solo and majority collectives are suitable for different cases, depending on load imbalance severity. Our main contributions are:

•
A detailed analysis of workload imbalance in deep learning training.

•
Definition and implementation of partial collectives, specifically solo and majority allreduce.

•
EagerSGD for asynchronous decentralized distributed training of neural networks with proof of convergence.

•
An experimental study of convergence and training speed for multiple networks, achieving 1.27$\times $ speedup over SGD on a video classification task without losing accuracy.
2. LoadImbalance in Deep Learning
Load imbalance widely exists in the training of deep learning models, which can be caused by either the applications or the system itself (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010; Hoefler et al., 2009, 2010).
2.1. Video Processing
Long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a type of unit cell in Recurrent Neural Networks (RNN). In video classification tasks, LSTMs are used (YueHei Ng et al., 2015a; Donahue et al., 2014; Ballas et al., 2015) to process a sequence of frames for a video as input (optionally following convolutional neural networks that preprocess the images to features), and output a probability distribution over a set of classes. Due to the recurrent structure of the network, the computational overhead is proportional to the number of frames in the input video.
Fig. 1(a) shows the video length distribution (number of frames) over all 9,537 videos in the training dataset of UCF101 (Soomro et al., 2012). The video length is distributed between 29 and 1,776 frames, with a median frame count of 167 and standard deviation of 97. Fig. 1(b) shows the runtime distribution over the 1,192 sampled batches in two epochs to train a 2,048wide singlelayer LSTM model on video frame features. As is standard in variablelength training, videos with similar lengths are grouped into buckets for performance. The runtime is distributed from 201 ms to 3,410 ms, with a mean runtime of 1,235 ms and standard deviation of 706 ms. These statistics above show that training an LSTM model for video classification exhibits inherent load imbalance.
2.2. Language Processing
Transformers (Vaswani et al., 2017) are sequencetosequence models that translate a sequence of words from one language to another. Different from RNN, a Transformer network replaces the recurrent structure with an attention mechanism. To train the Transformer model, the computation overhead increases with the length of the input (and output) sentences. Typically, the sentences in the training dataset for a language model have various lengths, and thus the workload is unbalanced across different batches. Fig. 3 shows the runtime distribution over the 20,653 randomly sampled batches in 1/3 epoch to train a Transformer on the WMT16 dataset. The runtime is distributed from 179 ms to 3,482 ms with a mean of 475 ms and standard deviation of 144 ms, which shows the inherent load imbalance in language model training.
2.3. Training in the Cloud
Performance variability is common in cloud computing (Schad et al., 2010; Iosup et al., 2011; Jackson et al., 2010). Fig. 4 shows the runtime distribution over the sampled batches for 5 epochs of training for the classic ResNet50 model (He et al., 2016) on ImageNet (Deng et al., 2009), on a standard Google Cloud instance (n1standard16 with 2x Nvidia V100 GPUs). The runtime is distributed from 399 ms to 1,892 ms with a mean of 454 ms and standard deviation of 116 ms. Since ResNet50 on ImageNet has the same input size for different batches, the load imbalance is caused mainly by the system. Compared with imbalanced applications (e.g., Transformer, LSTM), the load imbalance on cloud servers is relatively light.
3. Distributed Deep Learning
Deep neural networks are continuously differentiable functions that are composed of multiple operators, representable by a directed acyclic graph (LeCun et al., 2015). The gradient of those functions can be computed using the backpropagation algorithm (LeCun et al., 1998), processing the nodes in the DAG in a reverse topological order. Deep learning frameworks, such as TensorFlow (Abadi et al., 2015), typically execute parallel operations in the DAG in arbitrary order.
Supervised deep neural network training typically involves firstorder optimization in the form of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951). SGD optimizes the expected loss value over the “true” distribution of input samples by descending in the direction of a random subset of the training samples (minibatch). In a distributed dataparallel setting, the SGD algorithm (Algorithm 3) consists of multiple learner processes, each of which updates a global view of the parameters $w$ according to a different random minibatch at the same time. Given an update rule $U$ and local minibatch of size $B$, the learners modify the global view of the parameters by using an average of the gradients ${G}_{t}$ obtained by the agents.
[h] \[email protected]@algorithmic[1] \For$t=0$ to $T$ \State$\overrightarrow{x},\overrightarrow{y}\leftarrow $ Sample $B$ elements from dataset \State${w}_{t}\leftarrow $ Obtain parameters from global view \State$\overrightarrow{z}\leftarrow \mathrm{\ell}({w}_{t},\overrightarrow{x},\overrightarrow{y})$ \State${G}_{t}\leftarrow \frac{1}{B}{\mathrm{\Sigma}}_{i=0}^{B}\nabla \mathrm{\ell}({w}_{t},{\overrightarrow{z}}_{i})$ \State$\mathrm{\Delta}w\leftarrow U({G}_{t},{w}_{(0,\mathrm{\dots},t)},t)$ \StateUpdate global view of parameters to ${w}_{t}+\mathrm{\Delta}w$ \EndFor
A straightforward manner to maintain a global view is using a Parameter Server (PS) architecture (Dean et al., 2012), where one or several nodes assume the role of a PS, broadcasting uptodate weights (line 3) to learners prior to each step and aggregating gradients from them (line 7). This enables the PS to asynchronously update the global view (Recht et al., 2011), or require a fraction of learners to send gradients before progressing to the next step (Ho et al., 2013).
As the PS model is generally not scalable, another mode of operation implements SGD using collective operations. In such implementations, accumulating the gradients (line 7) is done via an allreduce operation, where each learner contains its own local view of the weights (BenNun and Hoefler, 2018). Horovod (Sergeev and Del Balso, 2018) is one such implementation over the TensorFlow framework, which also fuses several allreduce operations into one in order to reduce overhead. However, due to the arbitrary order of execution imposed by the frameworks, Horovod uses a master process for negotiation communication (achieving consensus on which parameters are sent).
A more scalable method, used in the Deep500 DSGD optimizer (BenNun et al., 2019), is to ensure an order of communication execution by adding control dependencies into the computation DAG, as shown in Fig. 5. In the backward pass, the allreduce operations are executed in a specific order after finishing the local gradient computation. Note that synchronizing gradient order can be avoided completely using nonblocking collectives (Message Passing Interface Forum, 2015). In this mode, each gradient communication message is assigned to an agreedupon numeric tag, and multiple allreduce operations may be inflight concurrently. Prior to updating the local view of the weights, a waitall command must be issued. All in all, these approaches reduce overhead in imbalanced loads by overlapping communication and computation, but do not mitigate it completely.
4. Partial Collective Operations
A collective communication involves a set of processes that cooperate to progress their internal state. Some of these operations, e.g., allreduce, implicitly synchronize the participants: the operation cannot terminate before the slowest process joins it. We define these collectives as synchronous and introduce a new class of partial collectives that relax the synchronization. We now discuss two variants of partial collectives: solo and majority.
4.1. Solo Collectives
A solo collective (Di Girolamo et al., 2015) is a waitfree operation, which forces the slow processes to execute the collective as soon as there is one process executing it. This process, called initiator, is in charge of informing the others to join the collective. While solo collectives remove the synchronization delays, they change the semantics of collective operations, which may now complete by using stale data from the slow processes.
4.1.1. Schedule Activation
We define a schedule as a set of operations that a process executes in order to globally progress the collective operation. In particular, a schedule is a directed acyclic graph (DAG) where the vertices are operations and the edges are happensbefore dependencies among them. We define the following operations:

•
Pointtopoint communications: sends and receives.

•
Computations: simple computations defined between two arrays of data items. The type of the data items is defined according to the MPI basic types (Message Passing Interface Forum, 2015).

•
Nonoperations (NOP): complete immediately and are only used to build dependencies.
Operations can be dependent on zero, one, or more other operations (with and or or logic) of the same schedule.
The main difference between synchronous and solo collectives is the time at which processes activate (i.e., starts executing) their schedule. For synchronous collectives, the schedule is executed only when a process reaches the collective function call (e.g., MPI_Allreduce). We define this activation as internal. For solo collectives, an external activation is also possible: the processes start executing the schedule because of an activation message received from the initiator, which starts broadcasting it immediately after the internal activation of its schedule. In particular, a solo collective is composed of two schedules: one for broadcasting the activation and the other one for executing the collective operation.
In a solo collective, any process can become the initiator, hence any process must be capable of broadcasting the activation message. The activation broadcast is implemented as a modified version of the recursive doubling communication scheme: this is equivalent to the union of $P$ binomial trees (optimal for small message broadcast, like the activation) rooted at the different nodes.
Activation example
Fig. 6 shows a solo allreduce example. On the left, we show the global communications view that is split in two phases: activation and allreduce. The highlighted communication shows the activation path if the initiator is, e.g., process P3. For the allreduce, we use a recursive doubling implementation. Note that any collective implementation that can be expressed as a schedule can be linked to the activation phase. On the right we show the internal schedule of process P3. An internal activation (i.e., P3 making the function call explicitly) translates in the execution of NOP 0 (N0): this leads to the send operations S0 and S1 being fired to start broadcasting the activation message and to the execution of N1, which signals the activation of the allreduce schedule. Alternatively, if P3 is not the initiator, it will receive a message in receive R0 or R1: if the activation is received by R0, then P3 has to forward the activation message to P1 with send S1 (i.e., P3 is an internal node of the activation binomial tree). Also in this case NOP N1 will be executed, leading to the execution of the allreduce schedule.
Multiple initiators
Multiple processes may join the collective at the same time: in this case we need to ensure that the collective is executed only once. To address this issue, we set the operations to be consumable, meaning that the same operation cannot be executed twice. For example, let us assume that nodes P2 and P3 reach their internal activation at the same time. When P3 receives the activation message from P2 (i.e., through R0) there are two possible cases: 1) S1 is still not consumed and then it is executed; 2) S1 has been fired due to the internal activation and will not be executed a second time. NOPs are also consumable, hence N1 (i.e., the activation) can be executed only once.
Persistent schedules
Processes can be asked to join a solo collective only once before they reach their internal activation: once the schedule is executed, it needs to be recreated by the application in order to be executed again. To enable multiple asynchronous executions of solo collectives, we introduce persistent schedules. Such schedules transparently replicate themselves once executed, able to serve a new solo collective without requiring application intervention. Multiple executions of the same solo collective overwrite the data in the receive buffer, which always contains the value of the latest execution.
4.2. Majority Collectives
An issue of solo collectives is that if one or few processes are always faster then the others, then the collective will always complete by taking the stale data of the slower processes. In cases like DNN training, this scenario may negatively impact the convergence because the training will advance only considering the updates of few processes. To overcome this issue, we introduce majority collectives, which requires at least half of the processes to join before completing. We implement majority collectives by not letting any process become the initiator, as in solo collectives. Instead, at each execution of a persistent schedule, the processes designate an initiator by randomly selecting a rank (consensus is achieved by using the same seed for all the processes). When a process joins the collective (i.e., internal activation), it checks whether it is the designated initiator: only in that case it keeps running the internal activation followed by the actual collective schedule.
We now discuss how the above described implementation can provide a statistical guarantee that at least half of the processes on average contribute to the collective. Suppose the same collective operation is called by many iterations, such as in model training. We sort all the $P$ processes by the time they reach a collective operation. Since the probability that any process is specified as the initiator is equal to 1/$P$, the expectation of the randomly specified initiator is the $P$/2th process among the sorted processes, namely on average half of the processes reach the collective operation earlier than the initiator. For a workload distribution with one mode and a tail, such as in Figs. 2, 3, and 4, the probability that part of the processes reach the collective at a similar time to the initiator is high; then, more than half of the processes on average actively participate in the operation.
4.3. Asynchronous Execution by Library Offloading
The schedule of a partial collective can be asynchronously executed with respect to the application. We develop a communication library that allows to express communication schedules and offload their execution to the library itself. The schedule execution can take place on the application thread (i.e., when the application enters the library), or on an auxiliary thread. Once the application creates and commits a schedule, the library starts executing all the operations that have no dependencies. The remaining ones are executed as their dependencies are satisfied.
4.4. Discussion
Offloading the schedule execution to the network interface card (NIC) can provide different advantages such as asynchronous execution, lower latency, and streaming processing. Di Girolamo et al. (Di Girolamo et al., 2015) show how solo collectives can be offloaded to Portals 4 (Barrett et al., 2018) NICs by using triggered operations. This approach is limited by the amount of NIC resources that bounds the number of times a persistent schedule can be executed without application intervention. In fact, persistent schedules cannot replicate themselves with Portals 4: if we want to asynchronously execute the same schedule $n$ times that we need to make and offload $n$ of its copies. After these executions, the host CPU needs to setup the schedule again. This limit can be removed by implementing the schedule execution with the sPIN programming model (Hoefler et al., 2017), which allows to execute userdefined code on the NIC. A sPIN implementation of partial collective operations would then be able to replicate the schedule onthefly upon completion.
5. EagerSGD
Algorithm 5 illustrates the main procedure of eagerSGD. Instead of calling a synchronous allreduce in the distributed optimizer (Fig. 5) to accumulate the gradients, eagerSGD uses the partial allreduce operations (Line 7). Either solo or majority allreduce can be used depending on the severity of load imbalance.
Fig. 7 presents an example of how eagerSGD works with partial collectives, in which w${}^{p}_{t}$ and G${}^{p}_{t}$ represent the weights and the gradients calculated on process $p$ at training step $t$, respectively, and $U(G,w)$ represents the update rule. In step $t$, suppose process P1 is faster than process P0. P1 finishes the computation of G${}^{1}_{t}$ and then triggers the partial allreduce operation. Since P0 does not finish the computation of G${}^{0}_{t}$ at this time, it only passively contributes null gradients G${}_{null}$ to the partial allreduce at step $t$. After P0 finishes the computation of G${}^{0}_{t}$, it finds out that the partial allreduce at step $t$ is already finished by checking the results in the receive buffer. P0 updates the weights of step $t+1$ using G${}^{1}_{t}$ stored in the receive buffer of the partial allreduce and G${}^{0}_{t}$ becomes the stale gradients. The stale gradient G${}^{0}_{t}$ is then stored in the send buffer. If P0 does not catch up with P1 at step $t+1$, P0 will passively participate in the partial allreduce again and contribute G${}^{0}_{t}$. If P0 catches up with P1 at step $t+1$ (as in the case shown in Fig. 7), P0 will add G${}^{0}_{t}$ and G${}^{0}_{t+1}$ (calculated in step $t+1$) together, and contribute the accumulated gradients G${}^{0\prime}_{t+1}$ to the partial allreduce; P0 resets the send buffer to G${}_{null}$ after finishing allreduce.
[t] \[email protected]@algorithmic[1] \State$b$ is local batchsize for $P$ processes \For$t=0$ to $T$ \State$\overrightarrow{x},\overrightarrow{y}\leftarrow $ Each process samples $b$ elements from dataset \State$\overrightarrow{z}\leftarrow \mathrm{\ell}({w}_{t},\overrightarrow{x},\overrightarrow{y})$ \State${G}_{t}^{local}\leftarrow \frac{1}{b}{\mathrm{\Sigma}}_{i=0}^{b}\nabla \mathrm{\ell}({w}_{t},{\overrightarrow{z}}_{i})$ \State${G}_{t}^{global}\leftarrow \frac{1}{P}partial\mathrm{\_}allreduce\left({G}_{t}^{local}\right)$ \State$\mathrm{\Delta}w\leftarrow U({G}_{t}^{global},{w}_{(0,\mathrm{\dots},t)},t)$ \State${w}_{t+1}\leftarrow {w}_{t}+\mathrm{\Delta}w$ \EndFor
In severe load imbalance situations, some slower processes may lag behind by more than one step. The data in the receive buffer of the partial allreduce will then be overwritten and only the latest data in the receive buffer can be seen, which results in different weights on different processes. This may result in slightly lower accuracy as shown in Section 6.2.2. Thus, we periodically synchronize the models across all processes to eliminate the side effect. Since we only synchronize the models every tens of epochs, the overhead can be ignored.
5.1. Correctness and Convergence Guarantees
5.1.1. System Model
In this section, we prove that, under a reasonable set of modeling assumptions, the eagerSGD algorithm will still converge. We assume a system with $P$ asynchronous processors indexed as $i\in \{0,1,\mathrm{\dots},P1\}$, which take steps at different speeds.
For the purposes of analysis, we can split the execution in serial fashion into rounds, where each round can be mapped to the partialallreduce of corresponding index. Without loss of generality, we assume that each processor participates in each round $t$, since it eventually submits an update to the corresponding partialallreduce, which we denote by $\text{\U0001d670\U0001d673\U0001d682}(t)$, for asynchronous distributed sum. However, its update may or may not be delivered to the other processors. Each partialallreduce has the following semantics:

•
(Invocation) Each process $i$ proposes a $d$dimensional vector ${R}_{t}^{i}$, corresponding to its current proposed update, to $\text{\U0001d670\U0001d673\U0001d682}(t)$.

•
(Response) Each process $i$ receives a tuple $\u27e8{U}_{t},{s}_{t}^{i}\u27e9$, where ${U}_{t}$ is the $d$dimensional update to the parameter set corresponding to round $t$, as decided by the shared object $\text{\U0001d670\U0001d673\U0001d682}(t)$, and ${s}_{t}^{i}$ is a boolean stating whether the update by process $i$ has been included in ${U}_{t}$.
We can therefore rephrase the algorithm as having each process invoke the $\text{\U0001d670\U0001d673\U0001d682}(t)$ object in each round, with its current update. If its update is not “accepted” (${s}_{t}^{i}=\text{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}$) then the processor simply adds it to its update in the next iteration. The ADS objects we implement provide the following guarantees.
Lemma 5.1 ().
Each ADS object ensures the following:

(1)
(Liveness) The $\text{\U0001d670\U0001d673\U0001d682}(t)$ object eventually returns an output at every invoking process.

(2)
(Safety) The output is consistent, in the sense that (1) it is a correct average of a subset of the proposed updates in the round; (2) the returned bits reflect its composition; and (3) the output is the same at every invoking process.

(3)
(Quorum Size) The subset of proposed updates included in the output is of size $Q\ge 1$, where $Q$ is a lower bound parameter ensured by the algorithm.

(4)
(Staleness Bound) There exists a bounded parameter $\tau $ such that any update by a process can be rejected by the ADS objects for at most $\tau $ consecutive rounds from the time it was generated before being accepted.
5.1.2. Convergence Proof
We now show that these properties are sufficient for eagerSGD to ensure convergence for a standard class of smooth nonconvex objectives. In the following, all norms are ${\mathrm{\ell}}_{2}$norms, unless otherwise stated.
Assumption 1 (Loss Function).
We assume that our objective loss function $f\mathrm{:}{\mathrm{R}}^{d}\mathrm{\to}\mathrm{R}$ satisfies the following:

•
(Lower Bound) The function $f$ is bounded from below, that is, there exists a finite value $m$ such that, $\forall \overrightarrow{x}\in {\mathbb{R}}^{d},f(\overrightarrow{x})\ge m$.

•
(Smoothness) The function $f$ is $L$smooth, i.e.
$$\forall \overrightarrow{x},\overrightarrow{y}\in {\mathbb{R}}^{d},\parallel \nabla f\left(\overrightarrow{x}\right)\nabla f\left(\overrightarrow{y}\right)\parallel \le L\parallel \overrightarrow{x}\overrightarrow{y}\parallel \text{\mathit{f}\mathit{o}\mathit{r}}L>0.$$
Further, we make the following standard assumptions about the gradients generated by the nodes:
Assumption 2 (Gradients).
For any round $t$ and process $i$, the gradients ${G}_{t}^{i}$ generated by the processes satisfy the following, where expectations are taken with respect to the random data sampling at round $t$.

•
(Unbiasedness) $\forall \overrightarrow{x}\in {\mathbb{R}}^{d},\mathbb{E}\left[{G}_{t}^{i}(\overrightarrow{x})\right]=\nabla f(\overrightarrow{x}),$

•
(Second Moment Bound) There exists $M$ s.t.:
$$\forall \overrightarrow{x}\in {\mathbb{R}}^{d},\mathbb{E}\left[{\parallel {G}_{t}^{i}\left(\overrightarrow{x}\right)\parallel}^{2}\right]\le {M}^{2}.$$
Convergence Bound
We can now prove the following:
Theorem 5.2 (EagerSGD Convergence).
Consider an arbitrary objective function $f$ and gradient sampling scheme satisfying Assumptions 1 and 2. Fix the success parameter $\u03f5\mathrm{>}\mathrm{0}$. Then, there exists a small constant learning rate value
$$\alpha \le \mathrm{min}(\frac{\sqrt{\u03f5P}}{\sqrt{12L\tau M(PQ)}},\frac{\u03f5P}{4{L}^{3}\tau M(PQ)},\frac{\u03f5}{12{M}^{2}L})$$ 
such that we execute the eagerSGD algorithm for $T\mathrm{=}\mathrm{\Theta}\mathit{}\mathrm{\left(}\frac{f\mathit{}\mathrm{(}{w}_{\mathrm{0}}\mathrm{)}\mathrm{}m}{\u03f5\mathit{}\alpha}\mathrm{\right)}$ iterations, we are guaranteed to reach an iterate ${w}_{{t}^{\mathrm{\star}}}$ with $\mathrm{1}\mathrm{\le}t\mathrm{\le}T$ such that
$${\parallel \nabla f({w}_{{t}^{\star}})\parallel}^{2}\le \u03f5.$$ 
Discussion
We make the following observations regarding the bound. First, we note that, since we analyze nonconvex objectives, we must settle for a weaker notion of convergence than in the convex case (where we can prove convergence to a global minimum): specifically, we prove that, for a given sequence of learning rates, the algorithm will converge to a point of negligible gradient. Second, we note the linear dependence in $\tau $ and $(PQ)$ for the number of iterations to convergence, i.e.:
$$T\ge \mathrm{\Theta}\left((f({w}_{0})m)\tau (PQ)/P{\u03f5}^{2}\right).$$ 
Thus, we would like the maximum delay and the number of “missed” gradients per round to be minimized. However, obviously, having no stragglers would imply higher synchronization cost. This suggests that, in practice, the algorithm should trade off the additional performance cost of synchronization with the slower convergence due to delayed gradient information.
Tasks  Models  Parameters  Train data size  Batch size  Epochs  Processes 

Hyperplane regression  Onelayer MLP  8,193  32,768 points  2,048  48  8 
Cifar10  ResNet32 (He et al., 2016)  467,194  50,000 images  512  190  8 
ImageNet (Deng et al., 2009)  ResNet50 (He et al., 2016)  25,559,081  1,281,167 images  8,192  90  64 
UCF101 (Soomro et al., 2012)  Inception+LSTM (YueHei Ng et al., 2015b)  34,663,525  9,537 videos  128  50  8 
6. Evaluation
Experiments are conducted on the CSCS Piz Daint supercomputer with Cray Aries interconnect. Each XC50 compute node contains a 12core Intel Xeon E52690 CPU with 64 GiB RAM, and one NVIDIA Tesla P100 GPU. The communication library is Cray MPICH 7.7.2. We use one MPI process per node and utilize the GPU for acceleration in all following experiments. First, we evaluate the performance of the partial collective operations using a microbenchmark. Then, we use the different neural networks summarized in Table 1 to compare our eagerSGD with the stateoftheart synchSGD implementations (Horovod (Sergeev and Del Balso, 2018) and Deep500 (BenNun et al., 2019)), under both simulated and real workload imbalance environments.
6.1. Partial Allreduce Operations
We design a microbenchmark, shown in Fig. 8, to evaluate the performance of partial allreduce operations and MPI_Allreduce (Cray MPICH) with unbalanced workload. All the processes are linearly skewed (line 4) before calling the collective operations and the average latency among all the processes is recorded (lines 710). The microbenchmark is a special case with severe load imbalance, which is useful to verify the statistical guarantee of majority allreduce. Experimental results on 32 processes are presented in Fig. 9. Compared with MPI_Allreduce, solo and majority allreduce operations reduce the latency by on average 53.32x and 2.46x, respectively. This is because all the processes (except the slowest one) for MPI_Allreduce are delayed; solo allreduce is not delayed since the fastest process will trigger the operation immediately; and majority allreduce has to wait for a randomly specified process to trigger the operation, and thus it is moderately delayed.
For the partial collective operations, we refer to the initiator together with the processes that arrive at the operation before the initiator as the active processes, which contribute the latest data (line 5 in Fig. 8). The other processes only contribute null values (line 13). For solo allreduce, since the fastest process is the initiator and all the processes are fully skewed, the Number of Active Processes (NAP) is around 1, as shown in Fig. 9. For majority allreduce, since the initiator is randomly specified, the expectation of NAP is half of the total processes. On average 16 out of 32 processes for majority allreduce are active processes, which means half of the processes contribute the latest data when the processes are fully skewed.
6.2. Throughput and Convergence with Simulated Dynamic Workload Imbalance
We use three networks shown in Table 1, including a multilayer perceptron (MLP), ResNet32, and ResNet50, to evaluate the performance of eagerSGD with simulated workload imbalance. From the application perspective, these three networks have balanced workload during the distributed training, since each batch has equivalent workload. We manually inject delays to simulate the dynamic load imbalance environment caused by the training system, as discussed in Section 2.3.
6.2.1. Hyperplane Regression, Light Load Imbalance
We generate both training and validation datasets for a 8,192dimensional hyperplane regression using the equation: $y={a}_{0}{x}_{0}+{a}_{1}{x}_{1}+\mathrm{\dots}+{a}_{8191}{x}_{8191}+noise$, where (${x}_{0},{x}_{1},\mathrm{\dots},{x}_{8191}$) is the input vector and $y$ is the label. An onelayer MLP is used to learn the coefficients (${a}_{0},{a}_{1},\mathrm{\dots},{a}_{8191}$) of the hyperplane. We use 8 processes with the total batch size of 2,048 to train the model for 48 epochs. To simulate the dynamic load imbalance environment, we randomly select one process out of the 8 processes at every training step to inject a certain amount of delay, according to the performance variability shown in Fig. 4. The throughput comparison between synchSGD (Deep500) and eagerSGD (using solo allreduce) is shown in Fig. 10 (top half). With 200, 300, and 400 ms load imbalance injection, eagerSGD achieves 1.50x, 1.75x, and 2.01x speedup over synchSGD, respectively. We observe that the more severe the load imbalance, the worse the performance of synchSGD because of the synchronization overhead. On the other hand, the performance of eagerSGD is stable. Given that the throughput on a single GPU node with batch size of 2,048 is 0.64 steps/s, eagerSGD with 400 ms load imbalance injection still achieves 3.8x speedup in strong scaling on 8 GPU nodes.
Fig. 10 (bottom half) presents the validation loss (mean squared error) as a function of the training time, which shows that eagerSGD using solo allreduce converges with equivalent loss value (around 4.7) to synchSGD but significantly reduces the training time. Since the processes are not severely skewed and the stale gradients are added to the next training iteration (as discussed in Section 5), using solo allreduce is enough for convergence. When using majority allreduce, the throughput of eagerSGD is lower than using solo allreduce (1.64 step/s vs 1.37 step/s with 200 ms load imbalance injection).
6.2.2. ResNet50 on ImageNet, Light Load Imbalance
Residual Network (ResNet) (He et al., 2016) is widely used in computer vision tasks. To evaluate the performance of eagerSGD, we use 64 processes with a total batch size of 8,192 to train ResNet50 on ImageNet for 90 epochs. To simulate the dynamic load imbalance environment, we randomly select 4 processes out of the 64 processes at every training step to inject a certain amount of delay, according to the performance variability on Cloud machines discussed in Section 2.3. Fig. 10(a) presents the throughput comparison between synchSGD (Horovod and Deep500) and eagerSGD using solo allreduce. With 300 and 460 ms load imbalance injection, eagerSGD achieves 1.25x and 1.23x speedup over Deep500, respectively; 1.14x and 1.22x speedup over Horovod, respectively. As the load imbalance injection increases to 460 ms, the throughput of eagerSGD decreases to 1.15 steps/s. This is because severe load imbalance leads to higher overhead in the activation phase of solo allreduce. Given that the throughput of a single GPU node with batch size of 128 is 1.56 steps/s, eagerSGD running on 64 processes with 460 ms load imbalance injection still achieves 46.9x speedup in weak scaling.
Fig. 10(b) and Fig. 10(c) present the Top1 train and test accuracy as a function of the training time, respectively. We train the model three times for each SGD, and obtain stable accuracy results. For top1 accuracy, Deep500 achieves 79.1% train accuracy and 75.7% test accuracy, Horovod achieves 79.0% train accuracy and 75.8% test accuracy, while eagerSGD using solo allreduce achieves 78.4% train accuracy and 75.2% test accuracy on average over different load imbalance injections. Note that without model synchronization at every 10 epochs, the top1 test accuracy of eagerSGD decreases to 74.1%. For top5 accuracy, synchSGD achieves 92.6% test accuracy, while eagerSGD using solo allreduce achieves 92.4% test accuracy on average. The experimental results on ResNet50 demonstrate that eagerSGD (solo) significantly improves the training speed without losing accuracy for deep neural networks in light load imbalance environment.
6.2.3. ResNet32 on Cifar10, Severe Load Imbalance
To test the robustness of eagerSGD, we train ResNet32 on Cifar10 with 8 processes for 190 epochs in a severe load imbalance environment. All 8 processes are skewed by injecting load imbalance from 50 ms to 400 ms at every training step. The injection amount over the processes is shifted after each step. Fig. 12 presents the test accuracy as a function of the training time. EagerSGD using solo allreduce has the highest training speed but with lower test accuracy. Solo allreduce only waits for the fastest process to inform the other processes to participate in allreduce, but most of them will contribute stale gradients. Majority allreduce can solve the lower accuracy problem caused by solo allreduce, which achieves approximately equivalent accuracy to synchSGD with 1.29x speedup. The results demonstrate that eagerSGD using majority allreduce is tolerant to severe load imbalance.
6.3. Case Study: Video Classification
As discussed in Section 2.1, LSTM on UCF101 for video classification has inherent workload imbalance because of different workload for different batches. We use Inception v3 (Szegedy et al., 2016), a CNN model, to extract a 2,048wide feature from each frame of the videos, and then pass the sequences of features to an LSTM model. The training time reported in the paper is only for the LSTM model, not including the preprocessing time using Inception v3.
To evaluate eagerSGD, we use 8 processes with a total batch size of 128 to train LSTM on UCF101 for 50 epochs (more information is in Table 1). Fig. 12(a) and Fig. 12(b) present the train and test accuracy as a function of the training time, respectively. For each SGD, we train the model four times and plot the curves using the average values. Colored areas around the curves are confidence intervals with the boundaries representing the standard deviation. Although eagerSGD using solo allreduce achieves 1.64x speedup over Horovod, it has lower accuracy. EagerSGD (solo) achieves on average 60.6% (up to 70.4%) top1 test accuracy while Horovod achieves on average 69.6%. This is because the workload of the video model is severely unbalanced, and solo allreduce introduces too many stale gradients. In contrast, eagerSGD using majority allreduce achieves 1.27x speedup over Horovod with equivalent accuracy. For example, Horovod achieves on average 69.6% top1 test accuracy (up to 72.2%) and 90.4% top5 test accuracy (up to 91.9%), while eagerSGD using majority allreduce achieves on average 69.7% top1 test accuracy (up to 72.8%) and 90.0% top5 test accuracy (up to 91.7%). Train accuracy results (in Fig. 12(a)) show a similar trend as the test accuracy. SynchSGD achieves on average 86.1% top1 train accuracy and on average 96.6% top5 train accuracy; eagerSGD using majority allreduce achieves on average 86.7% top1 train accuracy and on average 96.1% top5 train accuracy; eagerSGD using solo allreduce achieves on average 80.8% top1 train accuracy and on average 90.0% top5 train accuracy. All the accuracy results are consistent with that claimed in recent work (YueHei Ng et al., 2015b). The training speed and accuracy for Deep500 (not plotted in figures) are similar to Horovod. The results show that majority allreduce, with its statistical guarantee, is sufficient to both speed up training and achieve good accuracy.
The total training time for 50 epochs using a single GPU node with batch size of 16 and 128 is 34,301 seconds and 6,314 seconds, respectively. In week scaling, SynchSGD (Horovod) and eagerSGD using majority allreduce achieve 3.72x and 4.71x speedup on 8 GPU nodes, respectively. In strong scaling, synchSGD and eagerSGD using majority allreduce do not have speedup on 8 GPU nodes; in contrast, eagerSGD using solo allreduce achieves 1.12x speedup on 8 GPU nodes in strong scaling, but with lower accuracy. Note that increasing the batch size can further improve the speedup in strong scaling for eagerSGD. However, large batch sizes commonly need carefullytuned learning rate schedules to achieve good accuracy (You et al., 2018), which is out of scope.
7. Related Work
Deep learning
Parameter Server SGD implementations use synchronous (Li et al., 2014; Gupta et al., 2015), asynchronous (Dean et al., 2012; Chilimbi et al., 2014), stale (Ho et al., 2013; Zhang et al., 2015), and approximatesynchronous (Hsieh et al., 2017) SGD, where the latter two limit the age and insignificance of the gradients, respectively. In a decentralized setting, asynchrony is achieved by performing communication on an explicit subset of the learners, e.g., forming a ring (Lian et al., 2018) or a general graph (Xie et al., 2016); or on a random subset using Gossip Algorithms (Daily et al., 2018; Jin et al., 2016). These modes achieve some degree of asynchrony, but take $\mathcal{O}\left(P\right)$ or $\mathcal{O}\left(\mathrm{log}P\right)$ (for ring or gossipbased schemes, respectively) update steps to disseminate the information to all $P$ learners. To the best of our knowledge, this is the first work that implements asynchronous and stalesynchronous decentralized SGD where the messages propagate to all nodes in one step.
Collective communication
Several algorithms can be used to implement allreduce operations, and the optimal algorithm depends on network topology, number of processes, and message size (Thakur et al., 2005). For large message sizes and large number of processes, practical implementations employ the ringallreduce (Gibiansky, 2017) or the Rabenseifner’s Algorithm (Rabenseifner, 2004). Independently from the specific algorithm, the semantic of the allreduce implies processes synchronization. With eagerSGD we relax this synchronization by using solo and majority allreduce operations.
8. Conclusions
In this work, we show that load imbalance is prevalent in deep learning problems with variablelength inputs, and increasingly becoming an issue in cloud systems. To that end, we propose eagerSGD: an asynchronous decentralized version of SGD where fast processes contribute gradients without waiting for slow processes. Using the resilience of machine learning to bounded error, we implement two variants of partial collectives — solo and majority allreduce — enabling this behavior without a central parameter server.
The experimental results reaffirm our theoretical analysis, showing that eagerSGD using solo allreduce speeds up the training process in imbalanced environments (attaining up to 1.64$\times $ on UCF101) with a negligible loss of accuracy. Furthermore, as the load imbalance increases, the convergence rate degrades, in which case majority allreduce yields slightly lower performance (1.27$\times $ faster than SGD on UCF101) yet desirable generalization.
The research can extend in different directions. Firstly, the promising results make eagerSGD attractive for other applications as well, such as language models and object detection. Secondly, in order to provide different quorum sizes, it is possible to construct a spectrum between solo, majority, and full collectives. Lastly, partial collectives can be used for other algorithms beyond SGD.
Acknowledgements.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreement DAPP, No. 678880). We also would like to thank the Swiss National Supercomputing Center (CSCS) for providing the computing resources and for their excellent technical support.References
 (1)
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
 Alistarh et al. (2018) Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cedric Renggli. 2018. The Convergence of Sparsified Gradient Methods. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc.
 Amodei and Hernandez (2018) Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/aiandcompute/.
 Awan et al. (2017) A. Awan, K. Hamidouche, J. Hashmi, and D. Panda. 2017. SCaffe: Codesigning MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
 Ba and Frey (2013) Jimmy Ba and Brendan Frey. 2013. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems. 3084–3092.
 Ballas et al. (2015) Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving Deeper into Convolutional Networks for Learning Video Representations. arXiv eprints (2015). arXiv:1511.06432
 Barrett et al. (2018) Brian W Barrett, Ron Brightwell, , E Ryan Grant, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith D Underwood, R Reisen, Torsten Hoefler, Arthur B Maccabe, and Trammell Hudson. 2018. The Portals 4.2 network programming interface. Sandia National Laboratories, November 2018, Technical Report SAND20173825 (2018).
 BenNun et al. (2019) T. BenNun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler. 2019. A Modular Benchmarking Infrastructure for HighPerformance and Reproducible Deep Learning. IEEE. Accepted at the 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS’19).
 BenNun and Hoefler (2018) T. BenNun and T. Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An InDepth Concurrency Analysis. CoRR abs/1802.09941 (Feb. 2018).
 Chilimbi et al. (2014) Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571–582. https://www.usenix.org/conference/osdi14/technicalsessions/presentation/chilimbi
 Daily et al. (2018) Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. CoRR abs/1803.05880 (2018). arXiv:1803.05880 http://arxiv.org/abs/1803.05880
 Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223–1231.
 Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. 2009. Imagenet: A largescale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 248–255.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
 Devries and Taylor (2017) Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552
 Di Girolamo et al. (2015) Salvatore Di Girolamo, Pierre Jolivet, Keith D Underwood, and Torsten Hoefler. 2015. Exploiting offload enabled network interfaces. In 2015 IEEE 23rd Annual Symposium on HighPerformance Interconnects. IEEE, 26–33.
 Donahue et al. (2014) Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014. Longterm Recurrent Convolutional Networks for Visual Recognition and Description. CoRR abs/1411.4389 (2014). arXiv:1411.4389 http://arxiv.org/abs/1411.4389
 Gibiansky (2017) Andrew Gibiansky. 2017. Bringing HPC techniques to deep learning.(2017). URL http://research. baidu. com/bringinghpctechniquesdeeplearning (2017).
 Gupta et al. (2015) Suyog Gupta, Wei Zhang, and Fei Wang. 2015. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study. arXiv eprints (Sep 2015). arXiv:1509.04210
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
 Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1 (NIPS’13). Curran Associates Inc., USA, 1223–1231. http://dl.acm.org/citation.cfm?id=2999611.2999748
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.
 Hoefler et al. (2017) Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E Grant, and Ron Brightwell. 2017. sPIN: Highperformance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 59.
 Hoefler et al. (2007) Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Implementation and performance analysis of nonblocking collective operations for MPI. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 52.
 Hoefler and Moor (2014) T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58–75.
 Hoefler et al. (2009) T. Hoefler, T. Schneider, and A. Lumsdaine. 2009. The Effect of Network Noise on LargeScale Collective Communications. Parallel Processing Letters (PPL) 19, 4 (Aug. 2009), 573–593.
 Hoefler et al. (2010) T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. Characterizing the Influence of System Noise on LargeScale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10).
 Hsieh et al. (2017) Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geodistributed Machine Learning Approaching LAN Speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). USENIX Association, Berkeley, CA, USA, 629–647. http://dl.acm.org/citation.cfm?id=3154630.3154682
 Huang et al. (2017) G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. https://doi.org/10.1109/CVPR.2017.243
 Iosup et al. (2011) Alexandru Iosup, Nezih Yigitbasi, and Dick Epema. 2011. On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 104–113.
 Jackson et al. (2010) Keith R Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane Canon, Shreyas Cholia, John Shalf, Harvey J Wasserman, and Nicholas J Wright. 2010. Performance analysis of high performance computing applications on the amazon web services cloud. In 2nd IEEE international conference on cloud computing technology and science. IEEE, 159–168.
 Jin et al. (2016) Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? CoRR abs/1611.04581 (2016). arXiv:1611.04581 http://arxiv.org/abs/1611.04581
 Kurth et al. (2018) Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale Deep Learning for Climate Analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Piscataway, NJ, USA, Article 51, 12 pages. https://doi.org/10.1109/SC.2018.00054
 LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Li et al. (2014) Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and BorYiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 583–598. http://dl.acm.org/citation.cfm?id=2685048.2685095
 Lian et al. (2018) Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 3043–3052. http://proceedings.mlr.press/v80/lian18a.html
 Mathuriya et al. (2018) Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J. Pennycook, Kristyn Maschhoff, Jason Sewall, Nalini Kumar, Shirley Ho, Michael F. Ringenburg, Prabhat, and Victor Lee. 2018. CosmoFlow: Using Deep Learning to Learn the Universe at Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Piscataway, NJ, USA, Article 65, 11 pages. https://doi.org/10.1109/SC.2018.00068
 Message Passing Interface Forum (2015) Message Passing Interface Forum. 2015. MPI: A MessagePassing Interface Standard Version 3.1.
 Patarasuk and Yuan (2009) Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. https://doi.org/10.1016/j.jpdc.2008.09.002
 Rabenseifner (2004) Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1–9.
 Radford et al. (2018) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018). https://d4mucfpksywv.cloudfront.net/betterlanguagemodels/languagemodels.pdf
 Recht et al. (2011) B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A LockFree Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24. 693–701.
 Renggli et al. (2018) Cédric Renggli, Dan Alistarh, and Torsten Hoefler. 2018. SparCML: HighPerformance Sparse Communication for Machine Learning. CoRR abs/1802.08021 (2018). arXiv:1802.08021 http://arxiv.org/abs/1802.08021
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics (1951).
 Schad et al. (2010) Jörg Schad, Jens Dittrich, and JorgeArnulfo QuianéRuiz. 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proceedings of the VLDB Endowment 3, 12 (2010), 460–471.
 Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1Bit Stochastic Gradient Descent and its Application to DataParallel Distributed Training of Speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.
 Sergeev and Del Balso (2018) Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
 Strom (2015) Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
 Thakur et al. (2005) Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
 Xie et al. (2016) Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, and Eric Xing. 2016. Lightercommunication Distributed Machine Learning via Sufficient Factor Broadcasting. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence (UAI’16). AUAI Press, Arlington, Virginia, United States, 795–804. http://dl.acm.org/citation.cfm?id=3020948.3021030
 You et al. (2018) Yang You, Zhao Zhang, ChoJui Hsieh, James Demmel, and Kurt Keutzer. 2018. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing. ACM, 1.
 YueHei Ng et al. (2015a) Joe YueHei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015a. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 YueHei Ng et al. (2015b) Joe YueHei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015b. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4694–4702.
 Zhang et al. (2015) Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2015. Stalenessaware asyncsgd for distributed deep learning. arXiv preprint arXiv:1511.05950 (2015).