Abstract
Language models pretrained on text from a wide variety of sources form thefoundation of today's NLP. In light of the success of these broad-coveragemodels, we investigate whether it is still helpful to tailor a pretrained modelto the domain of a target task. We present a study across four domains(biomedical and computer science publications, news, and reviews) and eightclassification tasks, showing that a second phase of pretraining in-domain(domain-adaptive pretraining) leads to performance gains, under both high- andlow-resource settings. Moreover, adapting to the task's unlabeled data(task-adaptive pretraining) improves performance even after domain-adaptivepretraining. Finally, we show that adapting to a task corpus augmented usingsimple data selection strategies is an effective alternative, especially whenresources for domain-adaptive pretraining might be unavailable. Overall, weconsistently find that multi-phase adaptive pretraining offers large gains intask performance.