Abstract
Large Language Models (LLMs) have demonstrated remarkable performance acrossvarious Natural Language Processing (NLP) tasks, largely due to theirgeneralisability and ability to perform tasks without additional training.However, their effectiveness for low-resource languages remains limited. Inthis study, we evaluate the performance of 55 publicly available LLMs onMaltese, a low-resource language, using a newly introduced benchmark covering11 discriminative and generative tasks. Our experiments highlight that manymodels perform poorly, particularly on generative tasks, and that smallerfine-tuned models often perform better across all tasks. From ourmultidimensional analysis, we investigate various factors impactingperformance. We conclude that prior exposure to Maltese during pre-training andinstruction-tuning emerges as the most important factor. We also examine thetrade-offs between fine-tuning and prompting, highlighting that whilefine-tuning requires a higher initial cost, it yields better performance andlower inference costs. Through this work, we aim to highlight the need for moreinclusive language technologies and recommend that researchers working withlow-resource languages consider more "traditional" language modellingapproaches.