Abstract
Natural language interfaces to embodied AI are becoming more ubiquitous inour daily lives. This opens further opportunities for language-basedinteraction with embodied agents, such as a user instructing an agent toexecute some task in a specific location. For example, "put the bowls back inthe cupboard next to the fridge" or "meet me at the intersection under the redsign." As such, we need methods that interface between natural language and maprepresentations of the environment. To this end, we explore the question ofwhether we can use an open-set natural language query to identify a scenerepresented by a 3D scene graph. We define this task as "language-basedscene-retrieval" and it is closely related to "coarse-localization," but we areinstead searching for a match from a collection of disjoint scenes and notnecessarily a large-scale continuous map. Therefore, we presentText2SceneGraphMatcher, a "scene-retrieval" pipeline that learns jointembeddings between text descriptions and scene graphs to determine if they arematched. The code, trained models, and datasets will be made public.