SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

Abstract

With language modeling becoming the popular base task for unsupervisedrepresentation learning in Natural Language Processing, it is important to comeup with new architectures and techniques for faster and better training oflanguage models. However, due to a peculiarity of languages -- the larger thedataset, the higher the average number of times a word appears in that dataset-- datasets of different sizes have very different properties. Architecturesperforming well on small datasets might not perform well on larger ones. Forexample, LSTM models perform well on WikiText-2 but poorly on WikiText-103,while Transformer models perform well on WikiText-103 but not on WikiText-2.For setups like architectural search, this is a challenge since it isprohibitively costly to run a search on the full dataset but it is notindicative to experiment on smaller ones. In this paper, we introduceSimpleBooks, a small dataset with the average word frequency as high as that ofmuch larger ones. Created from 1,573 Gutenberg books with the highest ratio ofword-level book length to vocabulary size, SimpleBooks contains 92M word-leveltokens, on par with WikiText-103 (103M tokens), but has the vocabulary of 98K,a third of WikiText-103's. SimpleBooks can be downloaded fromhttps://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip.

Quick Read (beta)

loading the full paper ...