Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Abstract

Although masked language models are highly performant and widely adopted byNLP practitioners, they can not be easily used for autoregressive languagemodelling (next word prediction and sequence probability estimation). Wepresent an LSTM-based autoregressive language model which uses prefixembeddings (from a pretrained masked language model) via fusion (e.g.concatenation) to obtain a richer context representation for languagemodelling. We find that fusion helps reliably in lowering the perplexity (16.74$\rightarrow$ 15.80), which is even preserved after a transfer to a datasetfrom a different domain than the training data. We also evaluate thebest-performing fusion model by correlating its next word surprisal estimateswith human reading times. Contradicting our expectation, and despite theimprovement in perplexity overall, the correlation remains the same as for thebaseline model. Lastly, while we focus on language models pre-trained on textas the sources for the fusion, our approach can be possibly extended to fuseany information represented as a fixed-size vector into an auto-regressivelanguage model. These include e.g. sentence external information retrieved fora knowledge base or representations of multi-modal encoders.

Quick Read (beta)

loading the full paper ...