Abstract
3D visual grounding aims to locate the target object mentioned by free-formednatural language descriptions in 3D point cloud scenes. Most previous workrequires the encoder-decoder to simultaneously align the attribute informationof the target object and its relational information with the surroundingenvironment across modalities. This causes the queries' attention to bedispersed, potentially leading to an excessive focus on points irrelevant tothe input language descriptions. To alleviate these issues, we propose PD-TPE,a visual-language model with a double-branch decoder. The two branches performproposal feature decoding and surrounding layout awareness in parallel. Sincetheir attention maps are not influenced by each other, the queries focus ontokens relevant to each branch's specific objective. In particular, we design anovel Text-guided Position Encoding method, which differs between the twobranches. In the main branch, the priori relies on the relative positionsbetween tokens and predicted 3D boxes, which direct the model to pay moreattention to tokens near the object; in the surrounding branch, it is guided bythe similarity between visual and text features, so that the queries attend totokens that can provide effective layout information. Extensive experimentsdemonstrate that we surpass the state-of-the-art on two widely adopted 3Dvisual grounding datasets, ScanRefer and NR3D, by 1.8% and 2.2%, respectively.Codes will be made publicly available.