REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

Abstract

One of the long-term challenges of robotics is to enable robots to interactwith humans in the visual world via natural language, as humans are visualanimals that communicate through language. Overcoming this challenge requiresthe ability to perform a wide variety of complex tasks in response tomultifarious instructions from humans. In the hope that it might drive progresstowards more flexible and powerful human interactions with robots, we propose adataset of varied and complex robot tasks, described in natural language, interms of objects visible in a large set of real images. Given an instruction,success requires navigating through a previously-unseen environment to identifyan object. This represents a practical challenge, but one that closely reflectsone of the core visual problems in robotics. Several state-of-the-artvision-and-language navigation, and referring-expression models are tested toverify the difficulty of this new task, but none of them show promising resultsbecause there are many fundamental differences between our task and previousones. A novel Interactive Navigator-Pointer model is also proposed thatprovides a strong baseline on the task. The proposed model especially achievesthe best performance on the unseen test split, but still leaves substantialroom for improvement compared to the human performance.

Quick Read (beta)

loading the full paper ...