Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization

Abstract

We propose ActiveLR, an optimization meta algorithm that localizes thelearning rate, $\alpha$, and adapts them at each epoch according to whether thegradient at each epoch changes sign or not. This sign-conscious algorithm isaware of whether from the previous step to the current one the update of eachparameter has been too large or too small and adjusts the $\alpha$ accordingly.We implement the Active version (ours) of widely used and recently publishedgradient descent optimizers, namely SGD with momentum, AdamW, RAdam, andAdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, andPASCAL VOC using different model architectures, such as ResNet andTransformers, show an increase in generalizability and training set fit, anddecrease in training time for the Active variants of the tested optimizers. Theresults also show robustness of the Active variant of these optimizers todifferent values of the initial learning rate. Furthermore, the detrimentaleffects of using large mini-batch sizes are mitigated. ActiveLR, thus,alleviates the need for hyper-parameter search for two of the most commonlytuned hyper-parameters that require heavy time and computational costs to pick.We encourage AI researchers and practitioners to use the Active variant oftheir optimizer of choice for faster training, better generalizability, andreducing carbon footprint of training deep neural networks.

Quick Read (beta)

loading the full paper ...