LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Abstract

The effectiveness of instruction-tuned Large Language Models (LLMs) is oftenlimited in low-resource linguistic settings due to a lack of high-qualitytraining data. We introduce LuxIT, a novel, monolingual instruction tuningdataset for Luxembourgish developed to mitigate this challenge. We synthesizethe dataset from a corpus of native Luxembourgish texts, utilizingDeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Followinggeneration, we apply a quality assurance process, employing an LLM-as-a-judgeapproach. To investigate the practical utility of the dataset, we fine-tuneseveral smaller-scale LLMs on LuxIT. Subsequent benchmarking against their basemodels on Luxembourgish language proficiency examinations, however, yieldsmixed results, with performance varying significantly across different models.LuxIT represents a critical contribution to Luxembourgish natural languageprocessing and offers a replicable monolingual methodology, though our findingshighlight the need for further research to optimize its application.

Quick Read (beta)

loading the full paper ...