Conversations in Galician: a Large Language Model for an Underrepresented Language

Abstract

The recent proliferation of Large Conversation Language Models hashighlighted the economic significance of widespread access to this type of AItechnologies in the current information age. Nevertheless, prevailing modelshave primarily been trained on corpora consisting of documents written inpopular languages. The dearth of such cutting-edge tools for low-resourcelanguages further exacerbates their underrepresentation in the current economiclandscape, thereby impacting their native speakers. This paper introduces twonovel resources designed to enhance Natural Language Processing (NLP) for theGalician language. We present a Galician adaptation of the Alpaca dataset,comprising 52,000 instructions and demonstrations. This dataset provesinvaluable for enhancing language models by fine-tuning them to more accuratelyadhere to provided instructions. Additionally, as a demonstration of thedataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician,a language not originally supported by the model, by following the Alpacaformat. This work contributes to the research on multilingual models tailoredfor low-resource settings, a crucial endeavor in ensuring the inclusion of alllinguistic communities in the development of Large Language Models. Anothernoteworthy aspect of this research is the exploration of how knowledge of aclosely related language, in this case, Portuguese, can assist in generatingcoherent text when training resources are scarce. Both the Galician Alpacadataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and wehave made the source code available to facilitate replication of thisexperiment and encourage further advancements for underrepresented languages.

Quick Read (beta)

loading the full paper ...