LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Abstract

Expanding the long-context capabilities of Multi-modal Large LanguageModels~(MLLMs) is crucial for video understanding, high-resolution imageunderstanding, and multi-modal agents. This involves a series of systematicoptimizations, including model architecture, data construction and trainingstrategy, particularly addressing challenges such as \textit{degradedperformance with more images} and \textit{high computational costs}. In thispaper, we adapt the model architecture to a hybrid of Mamba and Transformerblocks, approach data construction with both temporal and spatial dependenciesamong multiple images and employ a progressive training strategy. The releasedmodel \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge\textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the firsthybrid MLLM, which achieved a better balance between efficiency andeffectiveness. LongLLaVA not only achieves competitive results across variousbenchmarks, but also maintains high throughput and low memory consumption.Especially, it could process nearly a thousand images on a single A100 80GBGPU, showing promising application prospects for a wide range of tasks.

Quick Read (beta)

loading the full paper ...