L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

Abstract

Data-parallel distributed training of deep neural networks (DNN) has gainedvery widespread adoption, but can still experience communication bottlenecks.To address this issue, entire families of compression mechanisms have beendeveloped, including quantization, sparsification, and low-rank approximation,some of which are seeing significant practical adoption. Despite this progress,almost all known compression schemes apply compression uniformly across DNNlayers, although layers are heterogeneous in terms of parameter count and theirimpact on model accuracy. In this work, we provide a general framework foradapting the degree of compression across the model's layers dynamically duringtraining, improving the overall compression, while leading to substantialspeedups, without sacrificing accuracy. Our framework, called L-GreCo, is basedon an adaptive algorithm, which automatically picks the optimal compressionparameters for model layers guaranteeing the best compression ratio whilesatisfying an error constraint. Extensive experiments over image classificationand language modeling tasks shows that L-GreCo is effective across all existingfamilies of compression methods, and achieves up to 2.5$\times$ trainingspeedup and up to 5$\times$ compression improvement over efficientimplementations of existing approaches, while recovering full accuracy.Moreover, L-GreCo is complementary to existing adaptive algorithms, improvingtheir compression ratio by 50% and practical throughput by 66%.

Quick Read (beta)

loading the full paper ...