Abstract
Since the release of ChatGPT, the field of Natural Language Processing hasexperienced rapid advancements, particularly in Large Language Models (LLMs)and their multimodal counterparts, Large Multimodal Models (LMMs). Despitetheir impressive capabilities, LLMs often exhibit significant performancedisparities across different languages and cultural contexts, as demonstratedby various text-only benchmarks. However, current research lacks suchbenchmarks for multimodal visio-linguistic settings. This work fills this gapby introducing M5, the first comprehensive benchmark designed to evaluate LMMson diverse vision-language tasks within a multilingual and multiculturalcontext. M5 includes eight datasets covering five tasks and $41$ languages,with a focus on underrepresented languages and culturally diverse images.Furthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including anew Visio-Linguistic Outlier Detection task, in which all evaluated open-sourcemodels fail to significantly surpass the random baseline. Through extensiveevaluation and analyses, we highlight substantial task-agnostic performancedisparities between high- and low-resource languages. Moreover, we show thatlarger models do not necessarily outperform smaller ones in a multilingualsetting.