NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizers

Abstract

Classical machine learning models such as deep neural networks are usuallytrained by using Stochastic Gradient Descent-based (SGD) algorithms. Theclassical SGD can be interpreted as a discretization of the stochastic gradientflow. In this paper we propose a novel, robust and accelerated stochasticoptimizer that relies on two key elements: (1) an accelerated Nesterov-likeStochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seideltype discretization. The convergence and stability of the obtained method,referred to as NAG-GS, are first studied extensively in the case of theminimization of a quadratic function. This analysis allows us to come up withan optimal step size (or learning rate) in terms of rate of convergence whileensuring the stability of NAG-GS. This is achieved by the careful analysis ofthe spectral radius of the iteration matrix and the covariance matrix atstationarity with respect to all hyperparameters of our method. We show thatNAG-GS is competitive with state-of-the-art methods such as momentum SGD withweight decay and AdamW for the training of machine learning models such as thelogistic regression model, the residual networks models on standard computervision datasets, and Transformers in the frame of the GLUE benchmark.

Quick Read (beta)

loading the full paper ...