Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Abstract

To accelerate the inference of heavy Multimodal Large Language Models(MLLMs), this study rethinks the current landscape of training-free tokenreduction research. We regret to find that the critical components of existingmethods are tightly intertwined, with their interconnections and effectsremaining unclear for comparison, transfer, and expansion. Therefore, wepropose a unified ''filter-correlate-compress'' paradigm that decomposes thetoken reduction into three distinct stages within a pipeline, maintainingconsistent design objectives and elements while allowing for uniqueimplementations. We additionally demystify the popular works and subsume theminto our paradigm to showcase its universality. Finally, we offer a suite ofmethods grounded in the paradigm, striking a balance between speed and accuracythroughout different phases of the inference. Experimental results across 10benchmarks indicate that our methods can achieve up to an 82.4% reduction inFLOPs with a minimal impact on performance, simultaneously surpassingstate-of-the-art training-free methods. Our project page is athttps://ficoco-accelerate.github.io/.

Quick Read (beta)

loading the full paper ...