Abstract
Existing multilingual vision-language (VL) benchmarks often only cover ahandful of languages. Consequently, evaluations of large vision-language models(LVLMs) predominantly target high-resource languages, underscoring the need forevaluation data for low-resource languages. To address this limitation, weintroduce MVL-SIB, a massively multilingual vision-language benchmark thatevaluates both cross-modal and text-only topical matching across 205 languages-- over 100 more than the most multilingual existing VL benchmarks encompass.We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topicmatching in lower-resource languages, performing no better than chance onlanguages like N'Koo. Our analysis further reveals that VL support in LVLMsdeclines disproportionately relative to textual support for lower-resourcelanguages, as evidenced by comparison of cross-modal and text-only topicalmatching performance. We further observe that open-weight LVLMs do not benefitfrom representing a topic with more than one image, suggesting that thesemodels are not yet fully effective at handling multi-image tasks. Bycorrelating performance on MVL-SIB with other multilingual VL benchmarks, wehighlight that MVL-SIB serves as a comprehensive probe of multilingual VLunderstanding in LVLMs.