The Geometry of Sign Gradient Descent

Abstract

Sign-based optimization methods have become popular in machine learning dueto their favorable communication cost in distributed optimization and theirsurprisingly good performance in neural network training. Furthermore, they areclosely connected to so-called adaptive gradient methods like Adam. Recentworks on signSGD have used a non-standard "separable smoothness" assumption,whereas some older works study sign gradient descent as steepest descent withrespect to the $\ell_\infty$-norm. In this work, we unify these existingresults by showing a close connection between separable smoothness and$\ell_\infty$-smoothness and argue that the latter is the weaker and morenatural assumption. We then proceed to study the smoothness constant withrespect to the $\ell_\infty$-norm and thereby isolate geometric properties ofthe objective function which affect the performance of sign-based methods. Inshort, we find sign-based methods to be preferable over gradient descent if (i)the Hessian is to some degree concentrated on its diagonal, and (ii) itsmaximal eigenvalue is much larger than the average eigenvalue. Both propertiesare common in deep networks.

Quick Read (beta)

loading the full paper ...