Boosting Asynchronous Decentralized Learning with Model Fragmentation

Abstract

Decentralized learning (DL) is an emerging technique that allows nodes on theweb to collaboratively train machine learning models without sharing raw data.Dealing with stragglers, i.e., nodes with slower compute or communication thanothers, is a key challenge in DL. We present DivShare, a novel asynchronous DLalgorithm that achieves fast model convergence in the presence of communicationstragglers. DivShare achieves this by having nodes fragment their models intoparameter subsets and send, in parallel to computation, each subset to a randomsample of other nodes instead of sequentially exchanging full models. Thetransfer of smaller fragments allows more efficient usage of the collectivebandwidth and enables nodes with slow network links to quickly contribute withat least some of their model parameters. By theoretically proving theconvergence of DivShare, we provide, to the best of our knowledge, the firstformal proof of convergence for a DL algorithm that accounts for the effects ofasynchronous communication with delays. We experimentally evaluate DivShareagainst two state-of-the-art DL baselines, AD-PSGD and Swift, and with twostandard datasets, CIFAR-10 and MovieLens. We find that DivShare withcommunication stragglers lowers time-to-accuracy by up to 3.9x compared toAD-PSGD on the CIFAR-10 dataset. Compared to baselines, DivShare also achievesup to 19.4% better accuracy and 9.5% lower test loss on the CIFAR-10 andMovieLens datasets, respectively.

Quick Read (beta)

loading the full paper ...