Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan

Abstract

Multilingual language models have been a crucial breakthrough as theyconsiderably reduce the need of data for under-resourced languages.Nevertheless, the superiority of language-specific models has already beenproven for languages having access to large amounts of data. In this work, wefocus on Catalan with the aim to explore to what extent a medium-sizedmonolingual language model is competitive with state-of-the-art largemultilingual models. For this, we: (1) build a clean, high-quality textualCatalan corpus (CaText), the largest to date (but only a fraction of the usualsize of the previous work in monolingual language models), (2) train aTransformer-based language model for Catalan (BERTa), and (3) devise a thoroughevaluation in a diversity of settings, comprising a complete array ofdownstream tasks, namely, Part of Speech Tagging, Named Entity Recognition andClassification, Text Classification, Question Answering, and Semantic TextualSimilarity, with most of the corresponding datasets being created ex novo. Theresult is a new benchmark, the Catalan Language Understanding Benchmark (CLUB),which we publish as an open resource, together with the clean textual corpus,the language model, and the cleaning pipeline. Using state-of-the-artmultilingual models and a monolingual model trained only on Wikipedia asbaselines, we consistently observe the superiority of our model across tasksand settings.

Quick Read (beta)

loading the full paper ...