rTop-k: A Statistical Estimation Approach to Distributed SGD

Abstract

The large communication cost for exchanging gradients between different nodessignificantly limits the scalability of distributed training for large-scalelearning models. Motivated by this observation, there has been significantrecent interest in techniques that reduce the communication cost of distributedStochastic Gradient Descent (SGD), with gradient sparsification techniques suchas top-k and random-k shown to be particularly effective. The same observationhas also motivated a separate line of work in distributed statisticalestimation theory focusing on the impact of communication constraints on theestimation efficiency of different statistical models. The primary goal of thispaper is to connect these two research lines and demonstrate how statisticalestimation models and their analysis can lead to new insights in the design ofcommunication-efficient training techniques. We propose a simple statisticalestimation model for the stochastic gradients which captures the sparsity andskewness of their distribution. The statistically optimal communication schemearising from the analysis of this model leads to a new sparsification techniquefor SGD, which concatenates random-k and top-k, considered separately in theprior literature. We show through extensive experiments on both image andlanguage domains with CIFAR-10, ImageNet, and Penn Treebank datasets that theconcatenated application of these two sparsification methods consistently andsignificantly outperforms either method applied alone.

Quick Read (beta)

loading the full paper ...