SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Abstract

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family ofvideo large language models (LLMs) offering a token-efficient solution forlong-form video understanding. This model family employs the two-streamSlowFast mechanism, enabling efficient modeling of long-range temporal contextto meet the demand for lightweight, mobile-friendly Video LLMs. We providemodels ranging from 1B to 7B parameters, optimized through a streamlinedtraining pipeline and a high-quality data mixture composed of publiclyavailable datasets. Experimental results demonstrate that SF-LLaVA-1.5 achievescompetitive performance on a wide range of video and image benchmarks, withrobust results across all model sizes. Notably, SF-LLaVA-1.5 achievesstate-of-the-art results in long-form video understanding (e.g., LongVideoBenchand MLVU) and excels at small scales (1B and 3B) across various videobenchmarks.

Quick Read (beta)

loading the full paper ...