Engagement Prediction of Short Videos with Large Multimodal Models

Abstract

The rapid proliferation of user-generated content (UGC) on short-form videoplatforms has made video engagement prediction increasingly important foroptimizing recommendation systems and guiding content creation. However, thistask remains challenging due to the complex interplay of factors such assemantic content, visual quality, audio characteristics, and user background.Prior studies have leveraged various types of features from differentmodalities, such as visual quality, semantic content, background sound, etc.,but often struggle to effectively model their cross-feature and cross-modalityinteractions. In this work, we empirically investigate the potential of largemultimodal models (LMMs) for video engagement prediction. We adopt tworepresentative LMMs: VideoLLaMA2, which integrates audio, visual, and languagemodalities, and Qwen2.5-VL, which models only visual and language modalities.Specifically, VideoLLaMA2 jointly processes key video frames, text-basedmetadata, and background sound, while Qwen2.5-VL utilizes only key video framesand text-based metadata. Trained on the SnapUGC dataset, both modelsdemonstrate competitive performance against state-of-the-art baselines,showcasing the effectiveness of LMMs in engagement prediction. Notably,VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance ofaudio features in engagement prediction. By ensembling two types of models, ourmethod achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge onshort-form video engagement prediction. The code is available athttps://github.com/sunwei925/LMM-EVQA.git.

Quick Read (beta)

loading the full paper ...