Object-and-Action Aware Model for Visual Language Navigation

Abstract

Vision-and-Language Navigation (VLN) is unique in that it requires turningrelatively general natural-language instructions into robot agent actions, onthe basis of the visible environment. This requires to extract value from twovery different types of natural-language information. The first is objectdescription (e.g., 'table', 'door'), each presenting as a tip for the agent todetermine the next action by finding the item visible in the environment, andthe second is action specification (e.g., 'go straight', 'turn left') whichallows the robot to directly predict the next movements without relying onvisual perceptions. However, most existing methods pay few attention todistinguish these information from each other during instruction encoding andmix together the matching between textual object/action encoding and visualperception/orientation features of candidate viewpoints. In this paper, wepropose an Object-and-Action Aware Model (OAAM) that processes these twodifferent forms of natural language based instruction separately. This enableseach process to match object-centered/action-centered instruction to their owncounterpart visual perception/action orientation flexibly. However, oneside-issue caused by above solution is that an object mentioned in instructionsmay be observed in the direction of two or more candidate viewpoints, thus theOAAM may not predict the viewpoint on the shortest path as the next action. Tohandle this problem, we design a simple but effective path loss to penalizetrajectories deviating from the ground truth path. Experimental resultsdemonstrate the effectiveness of the proposed model and path loss, and thesuperiority of their combination with a 50% SPL score on the R2R dataset and a40% CLS score on the R4R dataset in unseen environments, outperforming theprevious state-of-the-art.

Quick Read (beta)

loading the full paper ...