Abstract
Language provides a natural interface to specify and evaluate performance onvisual tasks. To realize this possibility, vision language models (VLMs) mustsuccessfully integrate visual and linguistic information. Our work comparesVLMs to a direct readout of their visual encoders to understand their abilityto integrate across these modalities. Across a series of vision-centricbenchmarks (e.g., depth estimation, correspondence), we find that VLMs performsubstantially worse than their visual encoders, dropping to near-chanceperformance. We investigate these results through a series of analyses acrossthe entire VLM: namely 1) the degradation of vision representations, 2)brittleness to task prompt, and 3) the language model's role in solving thetask. We find that the bottleneck in performing these vision-centric tasks liesin this third category; VLMs are not effectively using visual informationeasily accessible throughout the entire model, and they inherit the languagepriors present in the LLM. Our work helps diagnose the failure modes ofopen-source VLMs, and presents a series of evaluations useful for futureinvestigations into visual understanding within VLMs.