Abstract
Large Language Models (LLMs) remain difficult to evaluate comprehensively,particularly for languages other than English, where high-quality data is oftenlimited. Existing benchmarks and leaderboards are predominantlyEnglish-centric, with only a few addressing other languages. These benchmarksfall short in several key areas: they overlook the diversity of languagevarieties, prioritize fundamental Natural Language Processing (NLP)capabilities over tasks of industrial relevance, and are static. With theseaspects in mind, we present IberBench, a comprehensive and extensible benchmarkdesigned to assess LLM performance on both fundamental and industry-relevantNLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America.IberBench integrates 101 datasets from evaluation campaigns and recentbenchmarks, covering 22 task categories such as sentiment and emotion analysis,toxicity detection, and summarization. The benchmark addresses key limitationsin current evaluation practices, such as the lack of linguistic diversity andstatic evaluation setups by enabling continual updates and community-drivenmodel and dataset submissions moderated by a committee of experts. We evaluate23 LLMs ranging from 100 million to 14 billion parameters and provide empiricalinsights into their strengths and limitations. Our findings indicate that (i)LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii)performance is on average lower for Galician and Basque, (iii) some tasks showresults close to random, and (iv) in other tasks LLMs perform above random butbelow shared task systems. IberBench offers open-source implementations for theentire evaluation pipeline, including dataset normalization and hosting,incremental evaluation of LLMs, and a publicly accessible leaderboard.