Neograd: gradient descent with an adaptive learning rate

Abstract

Since its inception by Cauchy in 1847, the gradient descent algorithm hasbeen without guidance as to how to efficiently set the learning rate. Thispaper identifies a concept, defines metrics, and introduces algorithms toprovide such guidance. The result is a family of algorithms (Neograd) based ona {\em constant $\rho$ ansatz}, where $\rho$ is a metric based on the error ofthe updates. This allows one to adjust the learning rate at each step, using aformulaic estimate based on $\rho$. It is now no longer necessary to do trialruns beforehand to estimate a single learning rate for an entire optimizationrun. The additional costs to operate this metric are trivial. One member ofthis family of algorithms, NeogradM, can quickly reach much lower cost functionvalues than other first order algorithms. Comparisons are made mainly betweenNeogradM and Adam on an array of test functions and on a neural network modelfor identifying hand-written digits. The results show great performanceimprovements with NeogradM.

Quick Read (beta)

loading the full paper ...