Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

  • 2025-08-21 04:25:21
  • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler
  • 0

Abstract

Deep learning at scale is dominated by communication time. Distributingsamples across nodes usually yields the best performance, but poses scalingchallenges due to global information dissemination and load imbalance acrossuneven sample lengths. State-of-the-art decentralized optimizers mitigate theproblem, but require more iterations to achieve the same accuracy as theirglobally-communicating counterparts. We present Wait-Avoiding Group ModelAveraging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces globalcommunication via subgroup weight exchange. The key insight is a combination ofalgorithmic changes to the averaging scheme and the use of a group allreduceoperation. We prove the convergence of WAGMA-SGD, and empirically show that itretains convergence rates similar to Allreduce-SGD. For evaluation, we trainResNet-50 on ImageNet; Transformer for machine translation; and deepreinforcement learning for navigation at scale. Compared with state-of-the-artdecentralized SGD variants, WAGMA-SGD significantly improves trainingthroughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achievesthe fastest time-to-solution (e.g., the highest score using the shortesttraining time for Transformer).

 

Quick Read (beta)

loading the full paper ...