Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?

Abstract

Large language models (LLMs) demonstrate unprecedented capabilities anddefine the state of the art for almost all natural language processing (NLP)tasks and also for essentially all Language Technology (LT) applications. LLMscan only be trained for languages for which a sufficient amount of pre-trainingdata is available, effectively excluding many languages that are typicallycharacterised as under-resourced. However, there is both circumstantial andempirical evidence that multilingual LLMs, which have been trained using datasets that cover multiple languages (including under-resourced ones), do exhibitstrong capabilities for some of these under-resourced languages. Eventually,this approach may have the potential to be a technological off-ramp for thoseunder-resourced languages for which "native" LLMs, and LLM-based technologies,cannot be developed due to a lack of training data. This paper, whichconcentrates on European languages, examines this idea, analyses the currentsituation in terms of technology support and summarises related work. Thearticle concludes by focusing on the key open questions that need to beanswered for the approach to be put into practice in a systematic way.

Quick Read (beta)

loading the full paper ...