Gradient descent with generalized Newton's method

Abstract

We propose the generalized Newton's method (GeN) -- a Hessian-informedapproach that applies to any optimizer such as SGD and Adam, and covers theNewton-Raphson method as a sub-case. Our method automatically and dynamicallyselects the learning rate that accelerates the convergence, without theintensive tuning of the learning rate scheduler. In practice, our method iseasily implementable, since it only requires additional forward passes withalmost zero computational overhead (in terms of training time and memory cost),if the overhead is amortized over many iterations. We present extensiveexperiments on language and vision tasks (e.g. GPT and ResNet) to showcase thatGeN optimizers match the state-of-the-art performance, which was achieved withcarefully tuned learning rate schedulers.

Quick Read (beta)

loading the full paper ...