Abstract
Grounding language to the visual observations of a navigating agent can beperformed using off-the-shelf visual-language models pretrained onInternet-scale data (e.g., image captions). While this is useful for matchingimages to natural language descriptions of object goals, it remains disjointfrom the process of mapping the environment, so that it lacks the spatialprecision of classic geometric maps. To address this problem, we proposeVLMaps, a spatial map representation that directly fuses pretrainedvisual-language features with a 3D reconstruction of the physical world. VLMapscan be autonomously built from video feed on robots using standard explorationapproaches and enables natural language indexing of the map without additionallabeled data. Specifically, when combined with large language models (LLMs),VLMaps can be used to (i) translate natural language commands into a sequenceof open-vocabulary navigation goals (which, beyond prior work, can be spatialby construction, e.g., "in between the sofa and TV" or "three meters to theright of the chair") directly localized in the map, and (ii) can be sharedamong multiple robots with different embodiments to generate new obstacle mapson-the-fly (by using a list of obstacle categories). Extensive experimentscarried out in simulated and real world environments show that VLMaps enablenavigation according to more complex language instructions than existingmethods. Videos are available at https://vlmaps.github.io.