Abstract
We find that existing language modeling datasets contain many near-duplicateexamples and long repetitive substrings. As a result, over 1% of the unpromptedoutput of language models trained on these datasets is copied verbatim from thetraining data. We develop two tools that allow us to deduplicate trainingdatasets -- for example removing from C4 a single 61 word English sentence thatis repeated over 60,000 times. Deduplication allows us to train models thatemit memorized text ten times less frequently and require fewer train steps toachieve the same or better accuracy. We can also reduce train-test overlap,which affects over 4% of the validation set of standard datasets, thus allowingfor more accurate evaluation. We release code for reproducing our work andperforming dataset deduplication athttps://github.com/google-research/deduplicate-text-datasets.