LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Abstract

Long-context capability is critical for multi-modal foundation models. Weintroduce LongVILA, a full-stack solution for long-context vision-languagemodels, including system, model training, and dataset development. On thesystem side, we introduce the first long-context Multi-Modal SequenceParallelism (MM-SP) system that enables long training and inference, enabling2M context length training on 256 GPUs without any gradient checkpointing.MM-SP is 2.1x - 5.7x faster than ring sequence parallelism and 1.1x - 1.4xfaster than Megatron context parallelism + tensor parallelism in text-onlysettings. Moreover, it seamlessly integrates with Hugging Face Transformers.For model training, we propose a five-stage pipeline comprising alignment,pre-training, short supervised fine-tuning, context extension, and longsupervised fine-tuning. On datasets, we construct large-scale visual languagepre-training datasets and long video instruction-following datasets to supportour multi-stage training process. LongVILA extends the number of frames of VILAfrom 8 to 1024, and improves the long video captioning score from 2.00 to 3.26(1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length)needle-in-a-haystack. LongVILA-8B demonstrates consistent accuracy improvementson long videos in the VideoMME benchmark as the number of frames increases.

Quick Read (beta)

loading the full paper ...