AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models toprecisely locate sounding objects by integrating visual, auditory, and textualcues. Existing methods often lack genuine semantic understanding, tending tomemorize fixed reasoning patterns. Furthermore, jointly training for reasoningand segmentation can compromise pixel-level precision. To address these issues,we introduce AURORA, a novel framework designed to enhance genuine reasoningand language comprehension in reference audio-visual segmentation. We employ astructured Chain-of-Thought (CoT) prompting mechanism to guide the modelthrough a step-by-step reasoning process and introduce a novel segmentationfeature distillation loss to effectively integrate these reasoning abilitieswithout sacrificing segmentation performance. To further cultivate the model'sgenuine reasoning capabilities, we devise a further two-stage trainingstrategy: first, a ``corrective reflective-style training" stage utilizesself-correction to enhance the quality of reasoning paths, followed byreinforcement learning via Group Reward Policy Optimization (GRPO) to bolsterrobustness in challenging scenarios. Experiments demonstrate that AURORAachieves state-of-the-art performance on Ref-AVS benchmarks and generalizeseffectively to unreferenced segmentation.

Quick Read (beta)

loading the full paper ...