Abstract
Existing Large Multimodal Models (LMMs) generally focus on only a few regionsand languages. As LMMs continue to improve, it is increasingly important toensure they understand cultural contexts, respect local sensitivities, andsupport low-resource languages, all while effectively integrating correspondingvisual cues. In pursuit of culturally diverse global multimodal models, ourproposed All Languages Matter Benchmark (ALM-bench) represents the largest andmost comprehensive effort to date for evaluating LMMs across 100 languages.ALM-bench challenges existing models by testing their ability to understand andreason about culturally diverse images paired with text in various languages,including many low-resource languages traditionally underrepresented in LMMresearch. The benchmark offers a robust and nuanced evaluation frameworkfeaturing various question formats, including true/false, multiple choice, andopen-ended questions, which are further divided into short and long-answercategories. ALM-bench design ensures a comprehensive assessment of a model'sability to handle varied levels of difficulty in visual and linguisticreasoning. To capture the rich tapestry of global cultures, ALM-bench carefullycurates content from 13 distinct cultural aspects, ranging from traditions andrituals to famous personalities and celebrations. Through this, ALM-bench notonly provides a rigorous testing ground for state-of-the-art open andclosed-source LMMs but also highlights the importance of cultural andlinguistic inclusivity, encouraging the development of models that can servediverse global populations effectively. Our benchmark is publicly available.