3D-GRES: Generalized 3D Referring Expression Segmentation

Abstract

3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting aspecific instance within a 3D space based on a natural language description.However, current approaches are limited to segmenting a single target,restricting the versatility of the task. To overcome this limitation, weintroduce Generalized 3D Referring Expression Segmentation (3D-GRES), whichextends the capability to segment any number of instances based on naturallanguage instructions. In addressing this broader task, we propose theMulti-Query Decoupled Interaction Network (MDIN), designed to break downmulti-object segmentation tasks into simpler, individual segmentations. MDINcomprises two fundamental components: Text-driven Sparse Queries (TSQ) andMulti-object Decoupling Optimization (MDO). TSQ generates sparse point cloudfeatures distributed over key targets as the initialization for queries.Meanwhile, MDO is tasked with assigning each target in multi-object scenariosto different queries while maintaining their semantic consistency. To adapt tothis new task, we build a new dataset, namely Multi3DRes. Our comprehensiveevaluations on this dataset demonstrate substantial enhancements over existingmodels, thus charting a new path for intricate multi-object 3D scenecomprehension. The benchmark and code are available athttps://github.com/sosppxo/MDIN.

Quick Read (beta)

loading the full paper ...