Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Abstract

Fine-tuning pretrained contextual word embedding models to superviseddownstream tasks has become commonplace in natural language processing. Thisprocess, however, is often brittle: even with the same hyperparameter values,distinct random seeds can lead to substantially different results. To betterunderstand this phenomenon, we experiment with four datasets from the GLUEbenchmark, fine-tuning BERT hundreds of times on each while varying only therandom seeds. We find substantial performance increases compared to previouslyreported results, and we quantify how the performance of the best-found modelvaries as a function of the number of fine-tuning trials. Further, we examinetwo factors influenced by the choice of random seed: weight initialization andtraining data order. We find that both contribute comparably to the variance ofout-of-sample performance, and that some weight initializations perform wellacross all tasks explored. On small datasets, we observe that many fine-tuningtrials diverge part of the way through training, and we offer best practicesfor practitioners to stop training less promising runs early. We publiclyrelease all of our experimental data, including training and validation scoresfor 2,100 trials, to encourage further analysis of training dynamics duringfine-tuning.

Quick Read (beta)

loading the full paper ...