Partially Shuffling the Training Data to Improve Language Models

  • 2019-03-11 08:20:13
  • Ofir Press
  • 9

Abstract

Although SGD requires shuffling the training data between epochs, currentlynone of the word-level language modeling systems do this. Naively shuffling allsentences in the training data would not permit the model to learninter-sentence dependencies. Here we present a method that partially shufflesthe training data between epochs. This method makes each batch random, whilekeeping most sentence ordering intact. It achieves new state of the art resultson word-level language modeling on both the Penn Treebank and WikiText-2datasets.

 

Introduction (beta)

None

 

Conclusion (beta)

None