Abstract
The rise of Large Language Models (LLMs) has revolutionized natural languageprocessing across numerous languages and tasks. However, evaluating LLMperformance in a consistent and meaningful way across multiple Europeanlanguages remains challenging, especially due to the scarcity oflanguage-parallel multilingual benchmarks. We introduce a multilingualevaluation approach tailored for European languages. We employ translatedversions of five widely-used benchmarks to assess the capabilities of 40 LLMsacross 21 European languages. Our contributions include examining theeffectiveness of translated benchmarks, assessing the impact of differenttranslation services, and offering a multilingual evaluation framework for LLMsthat includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC,EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publiclyavailable to encourage further research in multilingual LLM evaluation.