Abstract
The evaluation of vision-language models (VLMs) has mainly relied onEnglish-language benchmarks, leaving significant gaps in both multilingual andmulticultural coverage. While multilingual benchmarks have expanded, both insize and languages, many rely on translations of English datasets, failing tocapture cultural nuances. In this work, we propose Kaleidoscope, as the mostcomprehensive exam benchmark to date for the multilingual evaluation ofvision-language models. Kaleidoscope is a large-scale, in-language multimodalbenchmark designed to evaluate VLMs across diverse languages and visual inputs.Kaleidoscope covers 18 languages and 14 different subjects, amounting to atotal of 20,911 multiple-choice questions. Built through an open sciencecollaboration with a diverse group of researchers worldwide, Kaleidoscopeensures linguistic and cultural authenticity. We evaluate top-performingmultilingual vision-language models and find that they perform poorly onlow-resource languages and in complex multimodal scenarios. Our resultshighlight the need for progress on culturally inclusive multimodal evaluationframeworks.