SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

  • 2025-03-24 18:59:07
  • Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan
  • 0

Abstract

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family ofvideo large language models (LLMs) offering a token-efficient solution forlong-form video understanding. This model family employs the two-streamSlowFast mechanism, enabling efficient modeling of long-range temporal contextto meet the demand for lightweight, mobile-friendly Video LLMs. We providemodels ranging from 1B to 7B parameters, optimized through a streamlinedtraining pipeline and a high-quality data mixture composed of publiclyavailable datasets. Experimental results demonstrate that SF-LLaVA-1.5 achievescompetitive performance on a wide range of video and image benchmarks, withrobust results across all model sizes. Notably, SF-LLaVA-1.5 achievesstate-of-the-art results in long-form video understanding (e.g., LongVideoBenchand MLVU) and excels at small scales (1B and 3B) across various videobenchmarks.

 

Quick Read (beta)

loading the full paper ...