Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Abstract

As large language models (LLMs) are applied to more use cases, creating highquality, task-specific datasets for fine-tuning becomes a bottleneck for modelimprovement. Using high quality human data has been the most common approach tounlock model performance, but is prohibitively expensive in many scenarios.Several alternative methods have also emerged, such as generating synthetic orhybrid data, but the effectiveness of these approaches remain unclear,especially in resource-constrained scenarios and tasks that are not easilyverified. To investigate this, we group various synthetic data generationstrategies into three representative categories -- Answer Augmentation,Question Rephrase and New Question -- and study the performance of student LLMstrained under various constraints, namely seed instruction set size and querybudget. We demonstrate that these strategies are not equally effective acrosssettings. Notably, the optimal data generation strategy depends strongly on theratio between the available teacher query budget and the size of the seedinstruction set. When this ratio is low, generating new answers to existingquestions proves most effective, but as this ratio increases, generating newquestions becomes optimal. Across all tasks, we find that choice ofaugmentation method and other design choices matter substantially more in lowto mid data regimes than in high data regimes. We provide a practical frameworkfor selecting the appropriate augmentation method across settings, taking intoaccount additional factors such as the scalability of each method, theimportance of verifying synthetic data, and the use of different LLMs forsynthetic data generation.

Quick Read (beta)

loading the full paper ...