Diversity-oriented Data Augmentation with Large Language Models

Abstract

Data augmentation is an essential technique in natural language processing(NLP) for enriching training datasets by generating diverse samples. Thisprocess is crucial for improving the robustness and generalization capabilitiesof NLP models. However, a significant challenge remains: \textit{InsufficientAttention to Sample Distribution Diversity}. Most existing methods focus onincreasing the sample numbers while neglecting the sample distributiondiversity, which can lead to model overfitting. In response, we explore dataaugmentation's impact on dataset diversity and propose a\textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data\textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). %\(\mathscr{DoAug}\) Specifically, we utilize a diversity-oriented fine-tuningapproach to train an LLM as a diverse paraphraser, which is capable ofaugmenting textual datasets by generating diversified paraphrases. Then, weapply the LLM paraphraser to a selected coreset of highly informative samplesand integrate the paraphrases with the original data to create a more diverseaugmented dataset. Finally, we conduct extensive experiments on 12 real-worldtextual datasets. The results show that our fine-tuned LLM augmenter improvesdiversity while preserving label consistency, thereby enhancing the robustnessand performance of downstream tasks. Specifically, it achieves an averageperformance gain of \(10.52\%\), surpassing the runner-up baseline with morethan three percentage points.

Quick Read (beta)

loading the full paper ...