Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Abstract

Vision-Language Models (VLMs) combine visual perception with the generalcapabilities, such as reasoning, of Large Language Models (LLMs). However, themechanisms by which these two abilities can be combined and contribute remainpoorly understood. In this work, we explore to compose perception and reasoningthrough model merging that connects parameters of different models. Unlikeprevious works that often focus on merging models of the same kind, we proposemerging models across modalities, enabling the incorporation of the reasoningcapabilities of LLMs into VLMs. Through extensive experiments, we demonstratethat model merging offers a successful pathway to transfer reasoning abilitiesfrom LLMs to VLMs in a training-free manner. Moreover, we utilize the mergedmodels to understand the internal mechanism of perception and reasoning and howmerging affects it. We find that perception capabilities are predominantlyencoded in the early layers of the model, whereas reasoning is largelyfacilitated by the middle-to-late layers. After merging, we observe that alllayers begin to contribute to reasoning, whereas the distribution of perceptionabilities across layers remains largely unchanged. These observations shedlight on the potential of model merging as a tool for multimodal integrationand interpretation.

Quick Read (beta)

loading the full paper ...