ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Abstract

High-quality data is crucial for the pre-training performance of largelanguage models. Unfortunately, existing quality filtering methods rely on aknown high-quality dataset as reference, which can introduce potential bias andcompromise diversity. In this paper, we propose ScalingFilter, a novel approachthat evaluates text quality based on the perplexity difference between twolanguage models trained on the same data, thereby eliminating the influence ofthe reference dataset in the filtering process. An theoretical analysis showsthat ScalingFilter is equivalent to an inverse utilization of scaling laws.Through training models with 1.3B parameters on the same data source processedby various quality filters, we find ScalingFilter can improve zero-shotperformance of pre-trained models in downstream tasks. To assess the biasintroduced by quality filtering, we introduce semantic diversity, a metric ofutilizing text embedding models for semantic representations. Extensiveexperiments reveal that semantic diversity is a reliable indicator of datasetdiversity, and ScalingFilter achieves an optimal balance between downstreamperformance and semantic diversity.

Quick Read (beta)

loading the full paper ...