Abstract
Grounding language to a navigating agent's observations can leveragepretrained multimodal foundation models to match perceptions to object or eventdescriptions. However, previous approaches remain disconnected from environmentmapping, lack the spatial precision of geometric maps, or neglect additionalmodality information beyond vision. To address this, we propose multimodalspatial language maps as a spatial map representation that fuses pretrainedmultimodal features with a 3D reconstruction of the environment. We build thesemaps autonomously using standard exploration. We present two instances of ourmaps, which are visual-language maps (VLMaps) and their extension toaudio-visual-language maps (AVLMaps) obtained by adding audio information. Whencombined with large language models (LLMs), VLMaps can (i) translate naturallanguage commands into open-vocabulary spatial goals (e.g., "in between thesofa and TV") directly localized in the map, and (ii) be shared acrossdifferent robot embodiments to generate tailored obstacle maps on demand.Building upon the capabilities above, AVLMaps extend VLMaps by introducing aunified 3D spatial representation integrating audio, visual, and language cuesthrough the fusion of features from pretrained multimodal foundation models.This enables robots to ground multimodal goal queries (e.g., text, images, oraudio snippets) to spatial locations for navigation. Additionally, theincorporation of diverse sensory inputs significantly enhances goaldisambiguation in ambiguous environments. Experiments in simulation andreal-world settings demonstrate that our multimodal spatial language mapsenable zero-shot spatial and multimodal goal navigation and improve recall by50% in ambiguous scenarios. These capabilities extend to mobile robots andtabletop manipulators, supporting navigation and interaction guided by visual,audio, and spatial cues.