Abstract
This paper explores the performance of encoder and decoder language models onmultilingual Natural Language Understanding (NLU) tasks, with a broad focus onGermanic languages. Building upon the ScandEval benchmark, initially restrictedto evaluating encoder models, we extend the evaluation framework to includedecoder models. We introduce a method for evaluating decoder models on NLUtasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic,Faroese, German, Dutch, and English. Through a series of experiments andanalyses, we also address research questions regarding the comparativeperformance of encoder and decoder models, the impact of NLU task types, andthe variation across language resources. Our findings reveal that encodermodels can achieve significantly better NLU performance than decoder modelsdespite having orders of magnitude fewer parameters. Additionally, weinvestigate the correlation between decoders and task performance via a UMAPanalysis, shedding light on the unique capabilities of decoder and encodermodels. This study contributes to a deeper understanding of language modelparadigms in NLU tasks and provides valuable insights for model selection andevaluation in multilingual settings.