LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Abstract

Long-context capability is critical for multi-modal foundation models,especially for long video understanding. We introduce LongVILA, a full-stacksolution for long-context visual-language models by co-designing the algorithmand system. For model training, we upgrade existing VLMs to support long videounderstanding by incorporating two additional stages, i.e., long contextextension and long supervised fine-tuning. However, training on long video iscomputationally and memory intensive. We introduce the long-context Multi-ModalSequence Parallelism (MM-SP) system that efficiently parallelizes long videotraining and inference, enabling 2M context length training on 256 GPUs withoutany gradient checkpointing. LongVILA efficiently extends the number of videoframes of VILA from 8 to 1024, improving the long video captioning score from2.00 to 3.26 (out of 5), achieving 99.5% accuracy in 1400-frame (274k contextlength) video needle-in-a-haystack. LongVILA-8B demonstrates consistentaccuracy improvements on long videos in the VideoMME benchmark as the number offrames increases. Besides, MM-SP is 2.1x - 5.7x faster than ring sequenceparallelism and 1.1x - 1.4x faster than Megatron with context parallelism +tensor parallelism. Moreover, it seamlessly integrates with Hugging FaceTransformers.

Quick Read (beta)

loading the full paper ...