Abstract
Recent Large Vision-Language Models (LVLMs) have advanced multi-modalunderstanding by incorporating finer-grained visual perception and encoding.However, such methods incur significant computational costs due to longervisual token sequences, posing challenges for real-time deployment. To mitigatethis, prior studies have explored pruning unimportant visual tokens either atthe output layer of the visual encoder or at the early layers of the languagemodel. In this work, we revisit these design choices and reassess theireffectiveness through comprehensive empirical studies of how visual tokens areprocessed throughout the visual encoding and language decoding stages. Guidedby these insights, we propose VScan, a two-stage visual token reductionframework that addresses token redundancy by: (1) integrating complementaryglobal and local scans with token merging during visual encoding, and (2)introducing pruning at intermediate layers of the language model. Extensiveexperimental results across four LVLMs validate the effectiveness of VScan inaccelerating inference and demonstrate its superior performance over currentstate-of-the-arts on sixteen benchmarks. Notably, when applied toLLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a10$\times$ reduction in FLOPs, while retaining 95.4% of the originalperformance.