Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

  • 2022-09-29 18:46:13
  • Jean Kaddour
  • 2

Abstract

Training vision or language models on large datasets can take days, if notweeks. We show that averaging the weights of the k latest checkpoints, eachcollected at the end of an epoch, can speed up the training progression interms of loss and accuracy by dozens of epochs, corresponding to time savingsup to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet andRoBERTa-Base model on WikiText-103, respectively. We also provide the code andmodel checkpoint trajectory to reproduce the results and facilitate research onreusing historical weights for faster convergence.

 

Quick Read (beta)

loading the full paper ...