EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

Abstract

3D visual grounding aims to find the objects within point clouds mentioned byfree-form natural language descriptions with rich semantic components. However,existing methods either extract the sentence-level features coupling all words,or focus more on object names, which would lose the word-level information orneglect other attributes. To alleviate this issue, we present EDA thatExplicitly Decouples the textual attributes in a sentence and conducts DenseAlignment between such fine-grained language and point cloud objects.Specifically, we first propose a text decoupling module to produce textualfeatures for every semantic component. Then, we design two losses to supervisethe dense matching between two modalities: the textual position alignment andobject semantic alignment. On top of that, we further introduce two new visualgrounding tasks, locating objects without object names and locating auxiliaryobjects referenced in the descriptions, both of which can thoroughly evaluatethe model's dense alignment capacity. Through experiments, we achievestate-of-the-art performance on two widely-adopted visual grounding datasets ,ScanRefer and SR3D/NR3D, and obtain absolute leadership on our twonewly-proposed tasks. The code will be available athttps://github.com/yanmin-wu/EDA.

Quick Read (beta)

loading the full paper ...