Abstract
Training vision or language models on large datasets can take days, if notweeks. We show that averaging the weights of the k latest checkpoints, eachcollected at the end of an epoch, can speed up the training progression interms of loss and accuracy by dozens of epochs, corresponding to time savingsup to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet andRoBERTa-Base model on WikiText-103, respectively. We also provide the code andmodel checkpoint trajectory to reproduce the results and facilitate research onreusing historical weights for faster convergence.
Quick Read (beta)
loading the full paper ...