Dual Averaging is Surprisingly Effective for Deep Learning Optimization

Abstract

First-order stochastic optimization methods are currently the most widelyused class of methods for training deep neural networks. However, the choice ofthe optimizer has become an ad-hoc rule that can significantly affect theperformance. For instance, SGD with momentum (SGD+M) is typically used incomputer vision (CV) and Adam is used for training transformer models forNatural Language Processing (NLP). Using the wrong method can lead tosignificant performance degradation. Inspired by the dual averaging algorithm,we propose Modernized Dual Averaging (MDA), an optimizer that is able toperform as well as SGD+M in CV and as Adam in NLP. Our method is not adaptiveand is significantly simpler than Adam. We show that MDA induces a decayinguncentered $L_2$-regularization compared to vanilla SGD+M and hypothesize thatthis may explain why it works on NLP problems where SGD+M fails.

Quick Read (beta)

loading the full paper ...