Evaluating language-biased image classification based on semantic representations

Abstract

Humans show language-biased image recognition for a word-embedded image,known as picture-word interference. Such interference depends on hierarchicalsemantic categories and reflects that human language processing highlyinteracts with visual processing. Similar to humans, recent artificial modelsjointly trained on texts and images, e.g., OpenAI CLIP, show language-biasedimage classification. Exploring whether the bias leads to interferences similarto those observed in humans can contribute to understanding how much the modelacquires hierarchical semantic representations from joint learning of languageand vision. The present study introduces methodological tools from thecognitive science literature to assess the biases of artificial models.Specifically, we introduce a benchmark task to test whether words superimposedon images can distort the image classification across different category levelsand, if it can, whether the perturbation is due to the shared semanticrepresentation between language and vision. Our dataset is a set ofword-embedded images and consists of a mixture of natural image datasets andhierarchical word labels with superordinate/basic category levels. Using thisbenchmark test, we evaluate the CLIP model. We show that presenting wordsdistorts the image classification by the model across different categorylevels, but the effect does not depend on the semantic relationship betweenimages and embedded words. This suggests that the semantic word representationin the CLIP visual processing is not shared with the image representation,although the word representation strongly dominates for word-embedded images.

Quick Read (beta)

loading the full paper ...