BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Abstract

Recent advances in generative AI have dramatically improved image and videosynthesis capabilities, significantly increasing the risk of misinformationthrough sophisticated fake content. In response, detection methods have evolvedfrom traditional approaches to multimodal large language models (MLLMs),offering enhanced transparency and interpretability in identifying syntheticmedia. However, current detection systems remain fundamentally limited by theirsingle-modality design. These approaches analyze images or videos separately,making them ineffective against synthetic content that combines multiple mediaformats. To address these challenges, we introduce \textbf{BusterX++}, a novelframework designed specifically for cross-modal detection and explanation ofsynthetic media. Our approach incorporates an advanced reinforcement learning(RL) post-training strategy that eliminates cold-start. Through Multi-stageTraining, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable andsubstantial performance improvements. To enable comprehensive evaluation, wealso present \textbf{GenBuster++}, a cross-modal benchmark leveragingstate-of-the-art image and video generation techniques. This benchmarkcomprises 4,000 images and video clips, meticulously curated by human expertsusing a novel filtering methodology to ensure high quality, diversity, andreal-world applicability. Extensive experiments demonstrate the effectivenessand generalizability of our approach.

Quick Read (beta)

loading the full paper ...