CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video

Abstract

The prevalence of user-generated content (UGC) on platforms such as YouTubeand TikTok has rendered no-reference (NR) perceptual video quality assessment(VQA) vital for optimizing video delivery. Nonetheless, the characteristics ofnon-professional acquisition and the subsequent transcoding of UGC video onsharing platforms present significant challenges for NR-VQA. Although NR-VQAmodels attempt to infer mean opinion scores (MOS), their modeling of subjectivescores for compressed content remains limited due to the absence offine-grained perceptual annotations of artifact types. To address thesechallenges, we propose CAMP-VQA, a novel NR-VQA framework that exploits thesemantic understanding capabilities of large vision-language models. Ourapproach introduces a quality-aware prompting mechanism that integrates videometadata (e.g., resolution, frame rate, bitrate) with key fragments extractedfrom inter-frame variations to guide the BLIP-2 pretraining approach ingenerating fine-grained quality captions. A unified architecture has beendesigned to model perceptual quality across three dimensions: semanticalignment, temporal characteristics, and spatial characteristics. Thesemultimodal features are extracted and fused, then regressed to video qualityscores. Extensive experiments on a wide variety of UGC datasets demonstratethat our model consistently outperforms existing NR-VQA methods, achievingimproved accuracy without the need for costly manual fine-grained annotations.Our method achieves the best performance in terms of average rank and linearcorrelation (SRCC: 0.928, PLCC: 0.938) compared to state-of-the-art methods.The source code and trained models, along with a user-friendly demo, areavailable at: https://github.com/xinyiW915/CAMP-VQA.

Quick Read (beta)

loading the full paper ...