When, Where and Why to Average Weights?

Abstract

Averaging checkpoints along the training trajectory is a simple yet powerfulapproach to improve the generalization performance of Machine Learning modelsand reduce training time. Motivated by these potential gains, and in an effortto fairly and thoroughly benchmark this technique, we present an extensiveevaluation of averaging techniques in modern Deep Learning, which we performusing AlgoPerf \citep{dahl_benchmarking_2023}, a large-scale benchmark foroptimization algorithms. We investigate whether weight averaging can reducetraining time, improve generalization, and replace learning rate decay, assuggested by recent literature. Our evaluation across seven architectures anddatasets reveals that averaging significantly accelerates training and yieldsconsiderable efficiency gains, at the price of a minimal implementation andmemory cost, while mildly improving generalization across all consideredworkloads. Finally, we explore the relationship between averaging and learningrate annealing and show how to optimally combine the two to achieve the bestperformances.

Quick Read (beta)

loading the full paper ...