Abstract
Many vision-language models (VLMs) that prove very effective at a range ofmultimodal task, build on CLIP-based vision encoders, which are known to havevarious limitations. We investigate the hypothesis that the strong languagebackbone in VLMs compensates for possibly weak visual features bycontextualizing or enriching them. Using three CLIP-based VLMs, we performcontrolled self-attention ablations on a carefully designed probing task. Ourfindings show that despite known limitations, CLIP visual representations offerready-to-read semantic information to the language decoder. However, inscenarios of reduced contextualization in the visual representations, thelanguage decoder can largely compensate for the deficiency and recoverperformance. This suggests a dynamic division of labor in VLMs and motivatesfuture architectures that offload more visual processing to the languagedecoder.