LongVILA: Scaling Long-Context Visual Language Models for Long Videos

  • 2024-08-21 18:47:33
  • Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
  • 0

Abstract

Long-context capability is critical for multi-modal foundation models,especially for long video understanding. We introduce LongVILA, a full-stacksolution for long-context visual-language models by co-designing the algorithmand system. For model training, we upgrade existing VLMs to support long videounderstanding by incorporating two additional stages, i.e., long contextextension and long supervised fine-tuning. However, training on long video iscomputationally and memory intensive. We introduce the long-context Multi-ModalSequence Parallelism (MM-SP) system that efficiently parallelizes long videotraining and inference, enabling 2M context length training on 256 GPUs withoutany gradient checkpointing. LongVILA efficiently extends the number of videoframes of VILA from 8 to 1024, improving the long video captioning score from2.00 to 3.26 (out of 5), achieving 99.5% accuracy in 1400-frame (274k contextlength) video needle-in-a-haystack. LongVILA-8B demonstrates consistentaccuracy improvements on long videos in the VideoMME benchmark as the number offrames increases. Besides, MM-SP is 2.1x - 5.7x faster than ring sequenceparallelism and 1.1x - 1.4x faster than Megatron with context parallelism +tensor parallelism. Moreover, it seamlessly integrates with Hugging FaceTransformers.

 

Quick Read (beta)

loading the full paper ...