SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Abstract

Multimodal Large Language Models (MLLMs) are commonly derived by extendingpre-trained Large Language Models (LLMs) with visual capabilities. In thiswork, we investigate how MLLMs process visual inputs by analyzing theirattention mechanisms. We reveal a surprising sparsity phenomenon: only a smallsubset (approximately less than 5%) of attention heads in LLMs activelycontribute to visual understanding, termed visual heads. To identify theseheads efficiently, we design a training-free framework that quantifieshead-level visual relevance through targeted response analysis. Building onthis discovery, we introduce SparseMM, a KV-Cache optimization strategy thatallocates asymmetric computation budgets to heads in LLMs based on their visualscores, leveraging the sparity of visual heads for accelerating the inferenceof MLLMs. Compared with prior KV-Cache acceleration methods that ignore theparticularity of visual, SparseMM prioritizes stress and retaining visualsemantics during decoding. Extensive evaluations across mainstream multimodalbenchmarks demonstrate that SparseMM achieves superior accuracy-efficiencytrade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52%memory reduction during generation while maintaining performance parity onefficiency test. Our project is open sourced athttps://github.com/CR400AF-A/SparseMM.

Quick Read (beta)

loading the full paper ...