Variance Reduction in Deep Learning: More Momentum is All You Need

Abstract

Variance reduction (VR) techniques have contributed significantly toaccelerating learning with massive datasets in the smooth and strongly convexsetting (Schmidt et al., 2017; Johnson & Zhang, 2013; Roux et al., 2012).However, such techniques have not yet met the same success in the realm oflarge-scale deep learning due to various factors such as the use of dataaugmentation or regularization methods like dropout (Defazio & Bottou, 2019).This challenge has recently motivated the design of novel variance reductiontechniques tailored explicitly for deep learning (Arnold et al., 2019; Ma &Yarats, 2018). This work is an additional step in this direction. Inparticular, we exploit the ubiquitous clustering structure of rich datasetsused in deep learning to design a family of scalable variance reducedoptimization procedures by combining existing optimizers (e.g., SGD+Momentum,Quasi Hyperbolic Momentum, Implicit Gradient Transport) with a multi-momentumstrategy (Yuan et al., 2019). Our proposal leads to faster convergence thanvanilla methods on standard benchmark datasets (e.g., CIFAR and ImageNet). Itis robust to label noise and amenable to distributed optimization. We provide aparallel implementation in JAX.

Quick Read (beta)

loading the full paper ...