Semantic and Expressive Variation in Image Captions Across Languages

Abstract

Computer vision often treats human perception as homogeneous: an implicitassumption that visual stimuli are perceived similarly by everyone. Thisassumption is reflected in the way researchers collect datasets and trainvision models. By contrast, literature in cross-cultural psychology andlinguistics has provided evidence that people from different culturalbackgrounds observe vastly different concepts even when viewing the same visualstimuli. In this paper, we study how these differences manifest themselves invision-language datasets and models, using language as a proxy for culture. Bycomparing textual descriptions generated across 7 languages for the sameimages, we find significant differences in the semantic content and linguisticexpression. When datasets are multilingual as opposed to monolingual,descriptions have higher semantic coverage on average, where coverage ismeasured using scene graphs, model embeddings, and linguistic taxonomies. Forexample, multilingual descriptions have on average 29.9% more objects, 24.5%more relations, and 46.0% more attributes than a set of monolingual captions.When prompted to describe images in different languages, popular models (e.g.LLaVA) inherit this bias and describe different parts of the image. Moreover,finetuning models on captions from one language performs best on correspondingtest data from that language, while finetuning on multilingual data performsconsistently well across all test data compositions. Our work points towardsthe need to account for and embrace the diversity of human perception in thecomputer vision community.

Quick Read (beta)

loading the full paper ...