Abstract
Distributed optimization is the standard way of speeding up machine learningtraining, and most of the research in the area focuses on distributedfirst-order, gradient-based methods. Yet, there are settings where somecomputationally-bounded nodes may not be able to implement first-order,gradient-based optimization, while they could still contribute to jointoptimization tasks. In this paper, we initiate the study of hybriddecentralized optimization, studying settings where nodes with zeroth-order andfirst-order optimization capabilities co-exist in a distributed system, andattempt to jointly solve an optimization task over some data distribution. Weessentially show that, under reasonable parameter settings, such a system cannot only withstand noisier zeroth-order agents but can even benefit fromintegrating such agents into the optimization process, rather than ignoringtheir information. At the core of our approach is a new analysis of distributedoptimization with noisy and possibly-biased gradient estimators, which may beof independent interest. Our results hold for both convex and non-convexobjectives. Experimental results on standard optimization tasks confirm ouranalysis, showing that hybrid first-zeroth order optimization can be practical,even when training deep neural networks.