Abstract
Multilingual vision-language models promise universal image-text retrieval,yet their social biases remain under-explored. We present the first systematicaudit of three public multilingual CLIP checkpoints -- M-CLIP, NLLB-CLIP, andCAPIVARA-CLIP -- across ten languages that vary in resource availability andgrammatical gender. Using balanced subsets of \textsc{FairFace} and the\textsc{PATA} stereotype suite in a zero-shot setting, we quantify race andgender bias and measure stereotype amplification. Contrary to the assumptionthat multilinguality mitigates bias, every model exhibits stronger gender biasthan its English-only baseline. CAPIVARA-CLIP shows its largest biasesprecisely in the low-resource languages it targets, while the sharedcross-lingual encoder of NLLB-CLIP transports English gender stereotypes intogender-neutral languages; loosely coupled encoders largely avoid this transfer.Highly gendered languages consistently magnify all measured bias types, buteven gender-neutral languages remain vulnerable when cross-lingual weightsharing imports foreign stereotypes. Aggregated metrics conceallanguage-specific ``hot spots,'' underscoring the need for fine-grained,language-aware bias evaluation in future multilingual vision-language research.