On the Convergence of FedAvg on Non-IID Data

Abstract

Federated learning makes a large amount of edge computing devices jointlylearn a model without data sharing. As a leading algorithm in this setting,Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) inparallel on a small subset of the total devices and averages the sequences onlyonce in a while. Despite its simplicity, it lacks theoretical guarantees underrealistic settings. In this paper, we analyze the convergence of\texttt{FedAvg} on non-iid data and establish a convergence rate of$\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$is the iteration number of SGDs. Importantly, our bound demonstrates atrade-off between communication-efficiency and convergence rate. As userdevices may be disconnected from the server, we relax the assumption of fulldevice participation to partial device participation and study differentaveraging schemes; low device participation rate can be achieved withoutseverely slowing down the learning. Our results indicate that heterogeneity ofdata slows down the convergence, which matches empirical observations.Furthermore, we provide a necessary condition for \texttt{FedAvg}'s convergenceon non-iid data: the learning rate $\eta$ must decay, even if full-gradient isused; otherwise, the solution will be $\Omega (\eta)$ away from the optimal.

Quick Read (beta)

loading the full paper ...