Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Abstract

Finetuning large language models on instruction data is crucial for enhancingpre-trained knowledge and improving instruction-following capabilities. Asinstruction datasets proliferate, selecting optimal data for effective trainingbecomes increasingly important. This work addresses the question: How can wedetermine the optimal subset of data for effective training? While existingresearch often emphasizes local criteria like instance quality for subsetselection, we argue that a global approach focused on data diversity is morecritical. Our method employs k-means clustering to ensure the selected subseteffectively represents the full dataset. We propose an iterative refinementmethod inspired by active learning techniques to resample instances fromclusters, reassessing each cluster's importance and sampling weight in everytraining iteration. This approach reduces the effect of outliers andautomatically filters out clusters containing low-quality data. Throughextensive evaluation across natural language reasoning, general worldknowledge, code and math reasoning tasks, and by fine-tuning models fromvarious families, we observe consistent improvements, achieving a 7% increaseover random selection and a 3.8% improvement over state-of-the-art samplingmethods. Our work highlights the significance of diversity-first sampling whenfinetuning LLMs to enhance performance across a broad array of evaluationtasks. Our code is available athttps://github.com/for-ai/iterative-data-selection.

Quick Read (beta)

loading the full paper ...