Towards Multilingual LLM Evaluation for European Languages

  • 2024-10-17 18:58:53
  • Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, Mehdi Ali
  • 0

Abstract

The rise of Large Language Models (LLMs) has revolutionized natural languageprocessing across numerous languages and tasks. However, evaluating LLMperformance in a consistent and meaningful way across multiple Europeanlanguages remains challenging, especially due to the scarcity oflanguage-parallel multilingual benchmarks. We introduce a multilingualevaluation approach tailored for European languages. We employ translatedversions of five widely-used benchmarks to assess the capabilities of 40 LLMsacross 21 European languages. Our contributions include examining theeffectiveness of translated benchmarks, assessing the impact of differenttranslation services, and offering a multilingual evaluation framework for LLMsthat includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC,EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publiclyavailable to encourage further research in multilingual LLM evaluation.

 

Quick Read (beta)

loading the full paper ...