Abstract
Goal-conditioned policies for robotic navigation can be trained on large,unannotated datasets, providing for good generalization to real-world settings.However, particularly in vision-based settings where specifying goals requiresan image, this makes for an unnatural interface. Language provides a moreconvenient modality for communication with robots, but contemporary methodstypically require expensive supervision, in the form of trajectories annotatedwith language descriptions. We present a system, LM-Nav, for robotic navigationthat enjoys the benefits of training on unannotated large datasets oftrajectories, while still providing a high-level interface to the user. Insteadof utilizing a labeled instruction following dataset, we show that such asystem can be constructed entirely out of pre-trained models for navigation(ViNG), image-language association (CLIP), and language modeling (GPT-3),without requiring any fine-tuning or language-annotated robot data. Weinstantiate LM-Nav on a real-world mobile robot and demonstrate long-horizonnavigation through complex, outdoor environments from natural languageinstructions. For videos of our experiments, code release, and an interactiveColab notebook that runs in your browser, please check out our project pagehttps://sites.google.com/view/lmnav