HiMix: Reducing Computational Complexity in Large Vision-Language Models

Abstract

Benefiting from recent advancements in large language models and modalityalignment techniques, existing Large Vision-Language Models(LVLMs) haveachieved prominent performance across a wide range of scenarios. However, theexcessive computational complexity limits the widespread use of these models inpractical applications. We argue that one main bottleneck in computationalcomplexity is caused by the involvement of redundant vision sequences in modelcomputation. This is inspired by a reassessment of the efficiency of vision andlanguage information transmission in the language decoder of LVLMs. Then, wepropose a novel hierarchical vision-language interaction mechanism calledHierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only thelanguage sequence undergoes full forward propagation, while the vision sequenceinteracts with the language at specific stages within each language decoderlayer. It is striking that our approach significantly reduces computationalcomplexity with minimal performance loss. Specifically, HiMix achieves a 10xreduction in the computational cost of the language decoder across multipleLVLM models while maintaining comparable performance. This highlights theadvantages of our method, and we hope our research brings new perspectives tothe field of vision-language understanding. Project Page:https://xuange923.github.io/HiMix

Quick Read (beta)

loading the full paper ...