Abstract
The Audio-Visual Segmentation (AVS) task aims to segment sounding objects inthe visual space using audio cues. However, in this work, it is recognized thatprevious AVS methods show a heavy reliance on detrimental segmentationpreferences related to audible objects, rather than precise audio guidance. Weargue that the primary reason is that audio lacks robust semantics compared tovision, especially in multi-source sounding scenes, resulting in weak audioguidance over the visual space. Motivated by the the fact that text modality iswell explored and contains rich abstract semantics, we propose leveraging textcues from the visual scene to enhance audio guidance with the semanticsinherent in text. Our approach begins by obtaining scene descriptions throughan off-the-shelf image captioner and prompting a frozen large language model todeduce potential sounding objects as text cues. Subsequently, we introduce anovel semantics-driven audio modeling module with a dynamic mask to integrateaudio features with text cues, leading to representative sounding objectfeatures. These features not only encompass audio cues but also possess vividsemantics, providing clearer guidance in the visual space. Experimental resultson AVS benchmarks validate that our method exhibits enhanced sensitivity toaudio when aided by text cues, achieving highly competitive performance on allthree subsets. Project page:\href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}