LongVILA: Scaling Long-Context Visual Language Models for Long Videos

  • 2024-08-20 18:56:24
  • Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
  • 0

Abstract

Long-context capability is critical for multi-modal foundation models. Weintroduce LongVILA, a full-stack solution for long-context vision-languagemodels, including system, model training, and dataset development. On thesystem side, we introduce the first long-context Multi-Modal SequenceParallelism (MM-SP) system that enables long training and inference, enabling2M context length training on 256 GPUs without any gradient checkpointing.MM-SP is 2.1x - 5.7x faster than ring sequence parallelism and 1.1x - 1.4xfaster than Megatron context parallelism + tensor parallelism in text-onlysettings. Moreover, it seamlessly integrates with Hugging Face Transformers.For model training, we propose a five-stage pipeline comprising alignment,pre-training, short supervised fine-tuning, context extension, and longsupervised fine-tuning. On datasets, we construct large-scale visual languagepre-training datasets and long video instruction-following datasets to supportour multi-stage training process. LongVILA extends the number of frames of VILAfrom 8 to 1024, and improves the long video captioning score from 2.00 to 3.26(1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length)needle-in-a-haystack. LongVILA-8B demonstrates consistent accuracy improvementson long videos in the VideoMME benchmark as the number of frames increases.

 

Quick Read (beta)

loading the full paper ...