Learning Free Token Reduction for Multi-Modal Large Language Models

Abstract

Vision-Language Models (VLMs) have achieved remarkable success across a rangeof multimodal tasks; however, their practical deployment is often constrainedby high computational costs and prolonged inference times. Since the visionmodality typically carries more information than the text modality, compressingvisual prompts offers a promising solution to alleviate these challenges.Existing approaches predominantly focus on refining model architectures ordirectly reducing the number of visual tokens. However, these methods oftencompromise inference performance due to a lack of consideration for the uniquespatial and temporal characteristics of visual data. In this work, we propose atoken compression paradigm that operates on both spatial and temporaldimensions. Our approach includes a learning-free, plug-and-play compressionpipeline that can be seamlessly integrated into most Multimodal Large LanguageModel (MLLM) frameworks. By leveraging this method, we enhance the modelinference capability while simultaneously reducing its computational cost.Experimental results on the Video-QA task demonstrate the effectiveness of theproposed approach, showcasing significant improvements in efficiency withoutsacrificing performance.

Quick Read (beta)

loading the full paper ...