V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Abstract

Video summarization aims to create short, accurate, and cohesive summaries oflonger videos. Despite the existence of various video summarization datasets, anotable limitation is their limited amount of source videos, which hampers theeffective fine-tuning of advanced large vision-language models (VLMs).Additionally, most existing datasets are created for video-to-videosummarization, overlooking the contemporary need for multimodal video contentsummarization. Recent efforts have been made to expand from unimodal tomultimodal video summarization, categorizing the task into three sub-tasksbased on the summary's modality: video-to-video (V2V), video-to-text (V2T), anda combination of video and text summarization (V2VT). However, the textualsummaries in previous multimodal datasets are inadequate. To address theseissues, we introduce Instruct-V2Xum, a cross-modal video summarization datasetfeaturing 30,000 diverse videos sourced from YouTube, with lengths ranging from40 to 940 seconds and an average summarization ratio of 16.39\%. Each videosummary in Instruct-V2Xum is paired with a textual summary that referencesspecific frame indexes, facilitating the generation of aligned video andtextual summaries. In addition, we propose a new video summarization frameworknamed V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is thefirst framework that unifies different video summarization tasks into one largelanguage model's (LLM) text decoder and achieves task-controllable videosummarization with temporal prompts and task instructions. Experiments showthat V2Xum-LLaMA outperforms strong baseline models on multiple videosummarization tasks. Furthermore, we propose an enhanced evaluation metric forV2V and V2VT summarization tasks.

Quick Read (beta)

loading the full paper ...