Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Abstract

Training vision or language models on large datasets can take days, if notweeks. We show that averaging the weights of the k latest checkpoints, eachcollected at the end of an epoch, can speed up the training progression interms of loss and accuracy by dozens of epochs, corresponding to time savingsup to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet andRoBERTa-Base model on WikiText-103, respectively. We also provide the code andmodel checkpoint trajectory to reproduce the results and facilitate research onreusing historical weights for faster convergence.

Quick Read (beta)

loading the full paper ...