Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success invision understanding, reasoning, and interaction. However, the inferencecomputation and memory increase progressively with the generation of outputtokens during decoding, directly affecting the efficacy of MLLMs. Existingmethods attempt to reduce the vision context redundancy to achieve efficientMLLMs. Unfortunately, the efficiency benefits of the vision context reductionin the prefill stage gradually diminish during the decoding stage. To addressthis problem, we proposed a dynamic vision-language context sparsificationframework Dynamic-LLaVA, which dynamically reduces the redundancy of visioncontext in the prefill stage and decreases the memory and computation overheadof the generated language context during decoding. Dynamic-LLaVA designs atailored sparsification inference scheme for different inference modes, i.e.,prefill, decoding with and without KV cache, to achieve efficient inference ofMLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by$\sim$75\% in the prefill stage. Meanwhile, throughout the entire generationprocess of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumptionunder decoding without KV cache, while saving $\sim$50\% GPU memory overheadwhen decoding with KV cache, due to the vision-language context sparsification.Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficientinference for MLLMs with negligible understanding and generation abilitydegradation or even performance gains compared to the full-context inferencebaselines. Code is available at https://github.com/Osilly/dynamic_llava .

Quick Read (beta)

loading the full paper ...