Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Abstract

Generative language modelling has surged in popularity with the emergence ofservices such as ChatGPT and Google Gemini. While these models havedemonstrated transformative potential in productivity and communication, theyoverwhelmingly cater to high-resource languages like English. This hasamplified concerns over linguistic inequality in natural language processing(NLP). This paper presents the first systematic review focused specifically onstrategies to address data scarcity in generative language modelling forlow-resource languages (LRL). Drawing from 54 studies, we identify, categoriseand evaluate technical approaches, including monolingual data augmentation,back-translation, multilingual training, and prompt engineering, acrossgenerative tasks. We also analyse trends in architecture choices, languagefamily representation, and evaluation methods. Our findings highlight a strongreliance on transformer-based models, a concentration on a small subset ofLRLs, and a lack of consistent evaluation across studies. We conclude withrecommendations for extending these methods to a wider range of LRLs andoutline open challenges in building equitable generative language systems.Ultimately, this review aims to support researchers and developers in buildinginclusive AI tools for underrepresented languages, a necessary step towardempowering LRL speakers and the preservation of linguistic diversity in a worldincreasingly shaped by large-scale language technologies.

Quick Read (beta)

loading the full paper ...