Abstract
Vision-Language models (VLMs) show impressive abilities to answer questionson visual inputs (e.g., counting objects in an image), yet demonstrate higheraccuracies when performing an analogous task on text (e.g., counting words in atext). We investigate this accuracy gap by identifying and comparing the\textit{circuits} - the task-specific computational sub-graphs - in differentmodalities. We show that while circuits are largely disjoint betweenmodalities, they implement relatively similar functionalities: the differenceslie primarily in processing modality-specific data positions (an image or atext sequence). Zooming in on the image data representations, we observe theybecome aligned with the higher-performing analogous textual representationsonly towards later layers, too late in processing to effectively influencesubsequent positions. To overcome this, we patch the representations of visualdata tokens from later layers back into earlier layers. In experiments withmultiple tasks and models, this simple intervention closes a third of theperformance gap between the modalities, on average. Our analysis sheds light onthe multi-modal performance gap in VLMs and suggests a training-free approachfor reducing it.