Averaging Weights Leads to Wider Optima and Better Generalization

  • 2018-03-14 17:09:27
  • Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson
  • 129

Abstract

Deep neural networks are typically trained by optimizing a loss function withan SGD variant, in conjunction with a decaying learning rate, untilconvergence. We show that simple averaging of multiple points along thetrajectory of SGD, with a cyclical or constant learning rate, leads to bettergeneralization than conventional training. We also show that this StochasticWeight Averaging (SWA) procedure finds much broader optima than SGD, andapproximates the recent Fast Geometric Ensembling (FGE) approach with a singlemodel. Using SWA we achieve notable improvement in test accuracy overconventional SGD training on a range of state-of-the-art residual networks,PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, andImageNet. In short, SWA is extremely easy to implement, improvesgeneralization, and has almost no computational overhead.

 

Quick Read (beta)

loading the full paper ...