Abstract
The excessive use of visual tokens in existing Multimoal Large LanguageModels (MLLMs) often exhibits obvious redundancy and brings in prohibitivelyexpensive computation. To gain insights into this problem, we first conductextensive empirical studies on the attention behaviors of MLLMs, and summarizethree main inference stages in MLLMs: (i) Early fusion between tokens is firstaccomplished quickly. (ii) Intra-modality modeling then comes to play. (iii)Multimodal reasoning} resumes and lasts until the end of inference. Inparticular, we reveal that visual tokens will stop contributing to reasoningwhen the text tokens receive enough image information, yielding obvious visualredundancy. Based on these generalized observations, we propose a simple yeteffective method to improve the efficiency of MLLMs, termed dynamicvisual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceivethe text token status and decide the removal of all visual tokens after acertain layer, thereby addressing the observed visual redundancy. To validateVTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL,and conduct extensive experiments on a bunch of benchmarks. The experimentresults not only show the effectiveness of our VTE in improving MLLMs'efficiency, but also yield the general modeling patterns of MLLMs, wellfacilitating the in-depth understanding of MLLMs. Our code is released athttps://github.com/DoubtedSteam/DyVTE.