Local SGD Converges Fast and Communicates Little

  • 2018-05-24 16:38:51
  • Sebastian U. Stich
  • 4

Abstract

Mini-batch stochastic gradient descent (SGD) is the state of the art in largescale parallel machine learning, but its scalability is limited by acommunication bottleneck. Recent work proposed local SGD, i.e. running SGDindependently in parallel on different workers and averaging only once in awhile. This scheme shows promising results in practice, but eluded thoroughtheoretical analysis. We prove concise convergence rates for local SGD onconvex problems and show that it converges at the same rate as mini-batch SGDin terms of number of evaluated gradients, that is, the scheme achieves linearspeed-up in the number of workers and mini-batch size. Moreover, the number ofcommunication rounds can be reduced up to a factor of T^{1/2}---where T denotesthe number of total steps---compared to mini-batch SGD.

 

Quick Read (beta)

loading the full paper ...