Abstract
We present a scalable approach for learning open-world object-goal navigation(ObjectNav) -- the task of asking a virtual robot (agent) to find any instanceof an object in an unexplored environment (e.g., "find a sink"). Our approachis entirely zero-shot -- i.e., it does not require ObjectNav rewards ordemonstrations of any kind. Instead, we train on the image-goal navigation(ImageNav) task, in which agents find the location where a picture (i.e., goalimage) was captured. Specifically, we encode goal images into a multimodal,semantic embedding space to enable training semantic-goal navigation(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D).After training, SemanticNav agents can be instructed to find objects describedin free-form natural language (e.g., "sink", "bathroom sink", etc.) byprojecting language goals into the same multimodal, semantic embedding space.As a result, our approach enables open-world ObjectNav. We extensively evaluateour agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observeabsolute improvements in success of 4.2% - 20.0% over existing zero-shotmethods. For reference, these gains are similar or better than the 5%improvement in success between the Habitat 2020 and 2021 ObjectNav challengewinners. In an open-world setting, we discover that our agents can generalizeto compound instructions with a room explicitly mentioned (e.g., "Find akitchen sink") and when the target room can be inferred (e.g., "Find a sink anda stove").