Introducing Visual Perception Token into Multimodal Large Language Model

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relieson the perception process of its vision encoder. The completeness and accuracyof visual perception significantly influence the precision of spatialreasoning, fine-grained understanding, and other tasks. However, MLLM stilllacks the autonomous capability to control its own visual perception processes,for example, selectively reviewing specific regions of an image or focusing oninformation related to specific object categories. In this work, we propose theconcept of Visual Perception Token, aiming to empower MLLM with a mechanism tocontrol its visual perception processes. We design two types of VisualPerception Tokens, termed the Region Selection Token and the Vision Re-EncodingToken. MLLMs autonomously generate these tokens, just as they generate text,and use them to trigger additional visual perception actions. The RegionSelection Token explicitly identifies specific regions in an image that requirefurther perception, while the Vision Re-Encoding Token uses its hidden statesas control signals to guide additional visual perception processes. Extensiveexperiments demonstrate the advantages of these tokens in handling spatialreasoning, improving fine-grained understanding, and other tasks. On average,the introduction of Visual Perception Tokens improves the performance of a 2Bmodel by 23.6\%, increasing its score from 0.572 to 0.708, and even outperformsa 7B parameter model by 13.4\% (from 0.624). Please check out our repohttps://github.com/yu-rp/VisualPerceptionToken

Quick Read (beta)

loading the full paper ...