Abstract
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps ofsound-producing objects within image frames and ensure the maps faithfullyadhere to the given audio, such as identifying and segmenting a singing personin a video. However, existing methods exhibit two limitations: 1) they addressvideo temporal features and audio-visual interactive features separately,disregarding the inherent spatial-temporal dependence of combined audio andvideo, and 2) they inadequately introduce audio constraints and object-levelinformation during the decoding stage, resulting in segmentation outcomes thatfail to comply with audio directives. To tackle these issues, we propose adecoupled audio-video transformer that combines audio and video features fromtheir respective temporal and spatial dimensions, capturing their combineddependence. To optimize memory consumption, we design a block, which, whenstacked, enables capturing audio-visual fine-grained combinatorial-dependencein a memory-efficient manner. Additionally, we introduce audio-constrainedqueries during the decoding phase. These queries contain rich object-levelinformation, ensuring the decoded mask adheres to the sounds. Experimentalresults confirm our approach's effectiveness, with our framework achieving anew SOTA performance on all three datasets using two backbones. The code isavailable at \url{https://github.com/aspirinone/CATR.github.io}