Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Abstract

Instructing language models with user intent requires large instructiondatasets, which are only available for a limited set of languages. In thispaper, we explore alternatives to conventional instruction adaptation pipelinesin low-resource scenarios. We assume a realistic scenario for low-resourcelanguages, where only the following are available: corpora in the targetlanguage, existing open-weight multilingual base and instructed backbone LLMs,and synthetically generated instructions sampled from the instructed backbone.We present a comprehensive set of experiments for Basque that systematicallystudy different combinations of these components evaluated on benchmarks andhuman preferences from 1,680 participants. Our conclusions show that targetlanguage corpora are essential, with synthetic instructions yielding robustmodels, and, most importantly, that using as backbone an instruction-tunedmodel outperforms using a base non-instructed model, and improved results whenscaling up. Using Llama 3.1 instruct 70B as backbone our model comes nearfrontier models of much larger sizes for Basque, without using any Basque dataapart from the 1.2B word corpora. We release code, models, instructiondatasets, and human preferences to support full reproducibility in futureresearch on low-resource language adaptation.

Quick Read (beta)

loading the full paper ...