RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

Abstract

3D referring segmentation is an emerging and challenging vision-language taskthat aims to segment the object described by a natural language expression in apoint cloud scene. The key challenge behind this task is vision-languagefeature fusion and alignment. In this work, we propose RefMask3D to explore thecomprehensive multi-modal feature interaction and understanding. First, wepropose a Geometry-Enhanced Group-Word Attention to integrate language withgeometrically coherent sub-clouds through cross-modal group-word attention,which effectively addresses the challenges posed by the sparse and irregularnature of point clouds. Then, we introduce a Linguistic Primitives Constructionto produce semantic primitives representing distinct semantic attributes, whichgreatly enhance the vision-language understanding at the decoding stage.Furthermore, we introduce an Object Cluster Module that analyzes theinterrelationships among linguistic primitives to consolidate their insightsand pinpoint common characteristics, helping to capture holistic informationand enhance the precision of target identification. The proposed RefMask3Dachieves new state-of-the-art performance on 3D referring segmentation, 3Dvisual grounding, and also 2D referring image segmentation. Especially,RefMask3D outperforms previous state-of-the-art method by a large margin of3.16% mIoU} on the challenging ScanRefer dataset. Code is available athttps://github.com/heshuting555/RefMask3D.

Quick Read (beta)

loading the full paper ...