Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Abstract

Stochastic gradient descent with momentum (SGDm) is one of the most popularoptimization algorithms in deep learning. While there is a rich theory of SGDmfor convex problems, the theory is considerably less developed in the contextof deep learning where the problem is non-convex and the gradient noise mightexhibit a heavy-tailed behavior, as empirically observed in recent studies. Inthis study, we consider a \emph{continuous-time} variant of SGDm, known as theunderdamped Langevin dynamics (ULD), and investigate its asymptotic propertiesunder heavy-tailed perturbations. Supported by recent studies from statisticalphysics, we argue both theoretically and empirically that the heavy-tails ofsuch perturbations can result in a bias even when the step-size is small, inthe sense that \emph{the optima of stationary distribution} of the dynamicsmight not match \emph{the optima of the cost function to be optimized}. As aremedy, we develop a novel framework, which we coin as \emph{fractional} ULD(FULD), and prove that FULD targets the so-called Gibbs distribution, whoseoptima exactly match the optima of the original cost. We observe that the Eulerdiscretization of FULD has noteworthy algorithmic similarities with\emph{natural gradient} methods and \emph{gradient clipping}, bringing a newperspective on understanding their role in deep learning. We support our theorywith experiments conducted on a synthetic model and neural networks.

Quick Read (beta)

loading the full paper ...