Abstract
Contrastive vision-language models (VLMs), like CLIP, have gained popularityfor their versatile applicability to various downstream tasks. Despite theirsuccesses in some tasks, like zero-shot object recognition, they performsurprisingly poor on other tasks, like attribute recognition. Previous work hasattributed these challenges to the modality gap, a separation of image and textin the shared representation space, and to a bias towards objects over otherfactors, such as attributes. In this analysis paper, we investigate bothphenomena thoroughly. We evaluated off-the-shelf VLMs and find that while thegap's influence on performance is typically overshadowed by other factors, wefind indications that closing the gap indeed leads to improvements. Moreover,we find that, contrary to intuition, only few embedding dimensions drive thegap and that the embedding spaces are differently organized. To allow for aclean study of object bias, we introduce a definition and a correspondingmeasure of it. Equipped with this tool, we find that object bias does not leadto worse performance on other concepts, such as attributes per se. However, whydo both phenomena, modality gap and object bias, emerge in the first place? Toanswer this fundamental question and uncover some of the inner workings ofcontrastive VLMs, we conducted experiments that allowed us to control theamount of shared information between the modalities. These experiments revealedthat the driving factor behind both the modality gap and the object bias, is aninformation imbalance between images and captions, and unveiled an intriguingconnection between the modality gap and entropy of the logits.