LanguageRefer: Spatial-Language Model for 3D Visual Grounding

Abstract

To realize robots that can understand human instructions and performmeaningful tasks in the near future, it is important to develop learned modelsthat can understand referential language to identify common objects inreal-world 3D scenes. In this paper, we develop a spatial-language model for a3D visual grounding problem. Specifically, given a reconstructed 3D scene inthe form of a point cloud with 3D bounding boxes of potential objectcandidates, and a language utterance referring to a target object in the scene,our model identifies the target object from a set of potential candidates. Ourspatial-language model uses a transformer-based architecture that combinesspatial embedding from bounding-box with a finetuned language embedding fromDistilBert and reasons among the objects in the 3D scene to find the targetobject. We show that our model performs competitively on visio-linguisticdatasets proposed by ReferIt3D. We provide additional analysis of performancein spatial reasoning tasks decoupled from perception noise, the effect ofview-dependent utterances in terms of accuracy, and view-point annotations forpotential robotics applications.

Quick Read (beta)

loading the full paper ...