MARS: Unleashing the Power of Variance Reduction for Training Large Models

Abstract

Training deep neural networks--and more recently, large models--demandsefficient and scalable optimizers. Adaptive gradient algorithms like Adam,AdamW, and their variants have been central to this task. Despite thedevelopment of numerous variance reduction algorithms in the past decade aimedat accelerating stochastic optimization in both convex and nonconvex settings,variance reduction has not found widespread success in training deep neuralnetworks or large language models. Consequently, it has remained a less favoredapproach in modern AI. In this paper, to unleash the power of variancereduction for efficient training of large models, we propose a unifiedoptimization framework, MARS (Make vAriance Reduction Shine), which reconcilespreconditioned gradient methods with variance reduction via a scaled stochasticrecursive momentum technique. Within our framework, we introduce threeinstances of MARS that leverage preconditioned gradient updates based on AdamW,Lion, and Shampoo, respectively. We also draw a connection between ouralgorithms and existing optimizers. Experimental results on training GPT-2models indicate that MARS consistently outperforms AdamW by a large margin.

Quick Read (beta)

loading the full paper ...