Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

Abstract

Low-resource languages (LRLs) face significant challenges in natural languageprocessing (NLP) due to limited data. While current state-of-the-art largelanguage models (LLMs) still struggle with LRLs, smaller multilingual models(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit oftheir capacity to low training data sizes. This study systematicallyinvestigates parameter-efficient adapter-based methods for adapting mLMs toLRLs, evaluating three architectures: Sequential Bottleneck, InvertibleBottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC andstructured knowledge from ConceptNet, we show that small adaptation datasets(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gainsin intrinsic (masked language modeling) and extrinsic tasks (topicclassification, sentiment analysis, and named entity recognition). We find thatSequential Bottleneck adapters excel in language modeling, while InvertibleBottleneck adapters slightly outperform other methods on downstream tasks dueto better embedding alignment and larger parameter counts. Adapter-basedmethods match or outperform full fine-tuning while using far fewer parameters,and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3,GPT-4, and DeepSeek-R1-based distilled models. While adaptation improvesperformance, pre-training data size remains the dominant factor, especially forlanguages with extensive pre-training coverage.

Quick Read (beta)

loading the full paper ...