Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Abstract

In multimodal large language models (MLLMs), the length of input visualtokens is often significantly greater than that of their textual counterparts,leading to a high inference cost. Many works aim to address this issue byremoving redundant visual tokens. However, current approaches either rely onattention-based pruning, which retains numerous duplicate tokens, or usesimilarity-based pruning, overlooking the instruction relevance, consequentlycausing suboptimal performance. In this paper, we go beyond attention orsimilarity by proposing a novel visual token pruning method named CDPruner,which maximizes the conditional diversity of retained tokens. We first definethe conditional similarity between visual tokens conditioned on theinstruction, and then reformulate the token pruning problem with determinantalpoint process (DPP) to maximize the conditional diversity of the selectedsubset. The proposed CDPruner is training-free and model-agnostic, allowingeasy application to various MLLMs. Extensive experiments across diverse MLLMsshow that CDPruner establishes new state-of-the-art on various vision-languagebenchmarks. By maximizing conditional diversity through DPP, the selectedsubset better represents the input images while closely adhering to userinstructions, thereby preserving strong performance even with high reductionratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latencyby 78\%, while maintaining 94\% of the original accuracy. Our code is availableat https://github.com/Theia-4869/CDPruner.

Quick Read (beta)

loading the full paper ...