A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

Abstract

Large Language Models (LLMs) are increasingly used to generate synthetictextual data for training smaller specialized models. However, a comparison ofvarious generation strategies for low-resource language settings is lacking.While various prompting strategies have been proposed, such as demonstrations,label-based summaries, and self-revision, their comparative effectivenessremains unclear, especially for low-resource languages. In this paper, wesystematically evaluate the performance of these generation strategies andtheir combinations across 11 typologically diverse languages, including severalextremely low-resource ones. Using three NLP tasks and four open-source LLMs,we assess downstream model performance on generated versus gold-standard data.Our results show that strategic combinations of generation methods,particularly target-language demonstrations with LLM-based revisions, yieldstrong performance, narrowing the gap with real data to as little as 5% in somesettings. We also find that smart prompting techniques can reduce the advantageof larger LLMs, highlighting efficient generation strategies for synthetic datageneration in low-resource scenarios with smaller models.

Quick Read (beta)

loading the full paper ...