To realize robots that can understand human instructions and performmeaningful tasks in the near future, it is important to develop learned modelsthat can understand referential language to identify common objects inreal-world 3D scenes. In this paper, we develop a spatial-language model for a3D visual grounding problem. Specifically, given a reconstructed 3D scene inthe form of a point cloud with 3D bounding boxes of potential objectcandidates, and a language utterance referring to a target object in the scene,our model identifies the target object from a set of potential candidates. Ourspatial-language model uses a transformer-based architecture that combinesspatial embedding from bounding-box with a finetuned language embedding fromDistilBert and reasons among the objects in the 3D scene to find the targetobject. We show that our model performs competitively on visio-linguisticdatasets proposed by ReferIt3D. We provide additional analysis of performancein spatial reasoning tasks decoupled from perception noise, the effect ofview-dependent utterances in terms of accuracy, and view-point annotations forpotential robotics applications.