Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Abstract

Recent work using auxiliary prediction task classifiers to investigate theproperties of LSTM representations has begun to shed light on why pretrainedrepresentations, like ELMo (Peters et al., 2018) and CoVe (McCann et al.,2017), are so beneficial for neural language understanding models. We still,though, do not yet have a clear understanding of how the choice of pretrainingobjective affects the type of linguistic information that models learn. Withthis in mind, we compare four objectives---language modeling, translation,skip-thought, and autoencoding---on their ability to induce syntactic andpart-of-speech information. We make a fair comparison between the tasks byholding constant the quantity and genre of the training data, as well as theLSTM architecture. We find that representations from language modelsconsistently perform best on our syntactic auxiliary prediction tasks, evenwhen trained on relatively small amounts of data. These results suggest thatlanguage modeling may be the best data-rich pretraining task for transferlearning applications requiring syntactic information. We also find that therepresentations from randomly-initialized, frozen LSTMs perform strikingly wellon our syntactic auxiliary tasks, but this effect disappears when the amount oftraining data for the auxiliary tasks is reduced.

Quick Read (beta)

loading the full paper ...