Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

Abstract

In the realm of Multimodal Large Language Models (MLLMs), vision-languageconnector plays a crucial role to link the pre-trained vision encoders withLarge Language Models (LLMs). Despite its importance, the vision-languageconnector has been relatively less explored. In this study, we aim to propose astrong vision-language connector that enables MLLMs to achieve high accuracywhile maintain low computation cost. We first reveal the existence of thevisual anchors in Vision Transformer and propose a cost-effective searchalgorithm to extract them. Building on these findings, we introduce the AnchorFormer (AcFormer), a novel vision-language connector designed to leverage therich prior knowledge obtained from these visual anchors during pretraining,guiding the aggregation of information. Through extensive experimentation, wedemonstrate that the proposed method significantly reduces computational costsby nearly two-thirds compared with baseline, while simultaneously outperformingbaseline methods. This highlights the effectiveness and efficiency of AcFormer.

Quick Read (beta)

loading the full paper ...