LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Abstract

Despite advancements in English-dominant generative large language models,further development is needed for low-resource languages to enhance globalaccessibility. The primary methods for representing these languages aremonolingual and multilingual pretraining. Monolingual pretraining is expensivedue to hardware requirements, and multilingual models often have unevenperformance across languages. This study explores an alternative solution byadapting large language models, primarily trained on English, to low-resourcelanguages. We assess various strategies, including continual training,instruction fine-tuning, task-specific fine-tuning, and vocabulary extension.The results show that continual training improves language comprehension, asreflected in perplexity scores, and task-specific tuning generally enhancesperformance of downstream tasks. However, extending the vocabulary shows nosubstantial benefits. Additionally, while larger models improve taskperformance with few-shot tuning, multilingual models perform worse than theirmonolingual counterparts when adapted.

Quick Read (beta)

loading the full paper ...