MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times

Abstract

We investigate the problem of minimizing the expectation of smooth nonconvexfunctions in a distributed setting with multiple parallel workers that are ableto compute stochastic gradients. A significant challenge in this context is thepresence of arbitrarily heterogeneous and stochastic compute times amongworkers, which can severely degrade the performance of existing parallelstochastic gradient descent (SGD) methods. While some parallel SGD algorithmsachieve optimal performance under deterministic but heterogeneous delays, theireffectiveness diminishes when compute times are random - a scenario notexplicitly addressed in their design. To bridge this gap, we introduceMindFlayer SGD, a novel parallel SGD method specifically designed to handlestochastic and heterogeneous compute times. Through theoretical analysis andempirical evaluation, we demonstrate that MindFlayer SGD consistentlyoutperforms existing baselines, particularly in environments with heavy-tailednoise. Our results highlight its robustness and scalability, making it acompelling choice for large-scale distributed learning tasks.

Quick Read (beta)

loading the full paper ...