Efficient Continual Pre-training of LLMs for Low-resource Languages

Abstract

Open-source Large Language models (OsLLMs) propel the democratization ofnatural language research by giving the flexibility to augment or update modelparameters for performance improvement. Nevertheless, like proprietary LLMs,Os-LLMs offer poorer performance on low-resource languages (LRLs) thanhigh-resource languages (HRLs), owing to smaller amounts of training data andunderrepresented vocabulary. On the other hand, continual pre-training (CPT)with large amounts of language-specific data is a costly proposition in termsof data acquisition and computational resources. Our goal is to drasticallyreduce CPT cost. To that end, we first develop a new algorithm to select asubset of texts from a larger corpus. We show the effectiveness of ourtechnique using very little CPT data. In search of further improvement, wedesign a new algorithm to select tokens to include in the LLM vocabulary. Weexperiment with the recent Llama-3 model and nine Indian languages with diversescripts and extent of resource availability. For evaluation, we useIndicGenBench, a generation task benchmark dataset for Indic languages. Weexperiment with various CPT corpora and augmented vocabulary size and offerinsights across language families.

Quick Read (beta)

loading the full paper ...