Partially Shuffling the Training Data to Improve Language Models

  • 2019-03-12 04:59:04
  • Ofir Press
Although SGD requires shuffling the training data between epochs, currentlynone of the word-level language modeling systems do this. Naively shuffling allsentences in the training data would not permit the model to learninter-sentence dependencies. Here we present a method that partially shufflesthe training data between epochs. This method makes each batch random, whilekeeping most sentence ordering intact. It achieves new state of the art resultson word-level language modeling on both the Penn Treebank and WikiText-2datasets.


