video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Abstract

Videos contain a wealth of information, and generating detailed and accuratedescriptions in natural language is a key aspect of video understanding. Inthis paper, we present video-SALMONN 2, an advanced audio-visual large languagemodel (LLM) with low-rank adaptation (LoRA) designed for enhanced video (withpaired audio) captioning through directed preference optimisation (DPO). Wepropose new metrics to evaluate the completeness and accuracy of videodescriptions, which are optimised using DPO. To further improve training, wepropose a novel multi-round DPO (MrDPO) approach, which involves periodicallyupdating the DPO reference model, merging and re-initialising the LoRA moduleas a proxy for parameter updates after each training round (1,000 steps), andincorporating guidance from ground-truth video captions to stabilise theprocess. Experimental results show that MrDPO significantly enhancesvideo-SALMONN 2's captioning accuracy, reducing the captioning error rates by28\%. The final video-SALMONN 2 model, with just 7 billion parameters,surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioningtasks, while maintaining highly competitive performance to the state-of-the-arton widely used video question-answering benchmarks among models of similarsize. Codes are available at\href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

Quick Read (beta)

loading the full paper ...