Abstract
Large language models (LLMs) strengthen instruction-following capabilitythrough instruction-finetuning (IFT) on supervised instruction/response data.However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisinglycontain many low-quality instances with incorrect or irrelevant responses,which are misleading and detrimental to IFT. In this paper, we propose a simpleand effective data selection strategy that automatically identifies and filtersout low-quality data using a strong LLM (e.g., ChatGPT). To this end, weintroduce AlpaGasus, which is finetuned on only 9k high-quality data filteredfrom the 52k Alpaca data. AlpaGasus significantly outperforms the originalAlpaca as evaluated by GPT-4 on multiple test sets and the controlled humanevaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM(i.e., Text-Davinci-003 generating the 52k data) on test tasks. It alsoprovides 5.7x faster training, reducing the training time for a 7B variant from80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove theefficacy of our method across diverse datasets, base models, and LLM filters.Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can begenerally applied to instruction-tuning data, leading to faster training andbetter instruction-following models. Our project page is available at:https://lichang-chen.github.io/AlpaGasus/