Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Abstract

Referring audio-visual segmentation (RAVS) has recently seen significantadvancements, yet challenges remain in integrating multimodal information anddeeply understanding and reasoning about audiovisual content. To extend theboundaries of RAVS and facilitate future research in this field, we proposeOmnimodal Referring Audio-Visual Segmentation (OmniAVS), a new datasetcontaining 2,098 videos and 59,458 multimodal referring expressions. OmniAVSstands out with three key innovations: (1) 8 types of multimodal expressionsthat flexibly combine text, speech, sound, and visual cues; (2) an emphasis onunderstanding audio content beyond just detecting their presence; and (3) theinclusion of complex reasoning and world knowledge in expressions. Furthermore,we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address thechallenges of multimodal reasoning and fine-grained understanding ofaudiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues andperform reasoning-based segmentation. Extensive experiments show that OISAoutperforms existing methods on OmniAVS and achieves competitive results onother related tasks.

Quick Read (beta)

loading the full paper ...