Learning-Rate-Free Learning by D-Adaptation

Abstract

The speed of gradient descent for convex Lipschitz functions is highlydependent on the choice of learning rate. Setting the learning rate to achievethe optimal convergence rate requires knowing the distance D from the initialpoint to the solution set. In this work, we describe a single-loop method, withno back-tracking or line searches, which does not require knowledge of $D$ yetasymptotically achieves the optimal rate of convergence for the complexityclass of convex Lipschitz functions. Our approach is the first parameter-freemethod for this class without additional multiplicative log factors in theconvergence rate. We present extensive experiments for SGD and Adam variants ofour method, where the method automatically matches hand-tuned learning ratesacross more than a dozen diverse machine learning problems, includinglarge-scale vision and language problems. Our method is practical, efficientand requires no additional function value or gradient evaluations each step. Anopen-source implementation is available(https://github.com/facebookresearch/dadaptation).

Quick Read (beta)

loading the full paper ...