DataComp-LM: In search of the next generation of training sets for language models

Abstract

We introduce DataComp for Language Models (DCLM), a testbed for controlleddataset experiments with the goal of improving language models. As part ofDCLM, we provide a standardized corpus of 240T tokens extracted from CommonCrawl, effective pretraining recipes based on the OpenLM framework, and a broadsuite of 53 downstream evaluations. Participants in the DCLM benchmark canexperiment with data curation strategies such as deduplication, filtering, anddata mixing at model scales ranging from 412M to 7B parameters. As a baselinefor DCLM, we conduct extensive experiments and find that model-based filteringis key to assembling a high-quality training set. The resulting dataset,DCLM-Baseline enables training a 7B parameter language model from scratch to64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, theprevious state-of-the-art in open-data language models, DCLM-Baselinerepresents a 6.6 percentage point improvement on MMLU while being trained with40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 andLlama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53natural language understanding tasks while being trained with 6.6x less computethan Llama 3 8B. Our results highlight the importance of dataset design fortraining language models and offer a starting point for further research ondata curation.

Quick Read (beta)

loading the full paper ...