Abstract
Pre-training datasets are typically collected from web content and lackinherent domain divisions. For instance, widely used datasets like Common Crawldo not include explicit domain labels, while manually curating labeled datasetssuch as The Pile is labor-intensive. Consequently, identifying an optimalpre-training data mixture remains a challenging problem, despite itssignificant benefits for pre-training performance. To address these challenges,we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), anautomated framework that discovers, evaluates, and refines data mixtures in apre-training setting. Specifically, CLIMB embeds and clusters large-scaledatasets in a semantic space and then iteratively searches for optimal mixturesusing a smaller proxy model and a predictor. When continuously trained on 400Btokens with this mixture, our 1B model exceeds the state-of-the-artLlama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specificdomain (e.g., Social Sciences) yields a 5% improvement over random sampling.Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20clusters as a research playground, and ClimbMix, a compact yet powerful400-billion-token dataset designed for efficient pre-training that deliverssuperior performance under an equal token budget. We analyze the final datamixture, elucidating the characteristics of an optimal data mixture. Our datais available at: https://research.nvidia.com/labs/lpr/climb/