Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Abstract

Data quality has become a key factor in enhancing model performance with therapid development of large language models (LLMs). Model-driven data filteringhas increasingly become a primary approach for acquiring high-quality data.However, it still faces two main challenges: (1) the lack of an efficient dataverification strategy makes it difficult to provide timely feedback on dataquality; and (2) the selection of seed data for training classifiers lacksclear criteria and relies heavily on human expertise, introducing a degree ofsubjectivity. To address the first challenge, we introduce an efficientverification strategy that enables rapid evaluation of the impact of data onLLM training with minimal computational cost. To tackle the second challenge,we build upon the assumption that high-quality seed data is beneficial for LLMtraining, and by integrating the proposed verification strategy, we optimizethe selection of positive and negative samples and propose an efficient datafiltering pipeline. This pipeline not only improves filtering efficiency,classifier quality, and robustness, but also significantly reduces experimentaland inference costs. In addition, to efficiently filter high-quality data, weemploy a lightweight classifier based on fastText, and successfully apply thefiltering pipeline to two widely-used pre-training corpora, FineWeb and ChineseFineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWebdataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120billion Chinese tokens. Empirical results demonstrate that the LLMs trained onUltra-FineWeb exhibit significant performance improvements across multiplebenchmark tasks, validating the effectiveness of our pipeline in enhancing bothdata quality and training efficiency.

Quick Read (beta)

loading the full paper ...