Improved Language Modeling by Decoding the Past

Abstract

Highly regularized LSTMs achieve impressive results on several benchmarkdatasets in language modeling. We propose a new regularization method based ondecoding the last token in the context using the predicted distribution of thenext token. This biases the model towards retaining more contextualinformation, in turn improving its ability to predict the next token. Withnegligible overhead in the number of parameters and training time, our PastDecode Regularization (PDR) method achieves a word level perplexity of 55.6 onthe Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax.We also show gains by using PDR in combination with a mixture-of-softmaxes,achieving a word level perplexity of 53.8 and 60.5 on these datasets. Inaddition, our method achieves 1.169 bits-per-character on the Penn TreebankCharacter dataset for character level language modeling. These resultsconstitute a new state-of-the-art in their respective settings.

Quick Read (beta)

loading the full paper ...