MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

Abstract

With the advancement of RNN models with linear complexity, the quadraticcomplexity challenge of transformers has the potential to be overcome. Notably,the emerging Mamba-2 has demonstrated competitive performance, bridging the gapbetween RNN models and transformers. However, due to sequential processing andvanishing gradients, RNN models struggle to capture long-range dependencies,limiting contextual understanding. This results in slow convergence, highresource demands, and poor performance on downstream understanding and complexreasoning tasks. In this work, we present a hybrid model MaTVLM by substitutinga portion of the transformer decoder layers in a pre-trained VLM with Mamba-2layers. Leveraging the inherent relationship between attention and Mamba-2, weinitialize Mamba-2 with corresponding attention weights to accelerateconvergence. Subsequently, we employ a single-stage distillation process, usingthe pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM,further enhancing convergence speed and performance. Furthermore, weinvestigate the impact of differential distillation loss within our trainingframework. We evaluate the MaTVLM on multiple benchmarks, demonstratingcompetitive performance against the teacher model and existing VLMs whilesurpassing both Mamba-based VLMs and models of comparable parameter scales.Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teachermodel while reducing GPU memory consumption by 27.5%, all without compromisingperformance. Code and models are released at http://github.com/hustvl/MaTVLM.

Quick Read (beta)

loading the full paper ...