DataComp-LM: In search of the next generation of training sets for language models

  • 2025-04-21 18:48:15
  • Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alex
  • 0

Abstract

We introduce DataComp for Language Models (DCLM), a testbed for controlleddataset experiments with the goal of improving language models. As part ofDCLM, we provide a standardized corpus of 240T tokens extracted from CommonCrawl, effective pretraining recipes based on the OpenLM framework, and a broadsuite of 53 downstream evaluations. Participants in the DCLM benchmark canexperiment with data curation strategies such as deduplication, filtering, anddata mixing at model scales ranging from 412M to 7B parameters. As a baselinefor DCLM, we conduct extensive experiments and find that model-based filteringis key to assembling a high-quality training set. The resulting dataset,DCLM-Baseline enables training a 7B parameter language model from scratch to64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, theprevious state-of-the-art in open-data language models, DCLM-Baselinerepresents a 6.6 percentage point improvement on MMLU while being trained with40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 andLlama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53natural language understanding tasks while being trained with 6.6x less computethan Llama 3 8B. Our results highlight the importance of dataset design fortraining language models and offer a starting point for further research ondata curation.

 

Quick Read (beta)

loading the full paper ...