Abstract
Natural language understanding has recently seen a surge of progress with theuse of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin etal., 2019) which are pretrained on variants of language modeling. We conductthe first large-scale systematic study of candidate pretraining tasks,comparing 19 different tasks both as alternatives and complements to languagemodeling. Our primary results support the use language modeling, especiallywhen combined with pretraining on additional labeled-data tasks. However, ourresults are mixed across pretraining tasks and show some concerning trends: InELMo's pretrain-then-freeze paradigm, random baselines are worryingly strongand results vary strikingly across target tasks. In addition, fine-tuning BERTon an intermediate task often negatively impacts downstream transfer. In a morepositive trend, we see modest gains from multitask training, suggesting thedevelopment of more sophisticated multitask and transfer learning techniques asan avenue for further research.