Speaker-Follower Models for Vision-and-Language Navigation

Abstract

Navigation guided by natural language instructions presents a challengingreasoning problem for instruction followers. Natural language instructionstypically identify only a few high-level decisions and landmarks rather thancomplete low-level motor behaviors; much of the missing information must beinferred based on perceptual context. In machine learning settings, this isdoubly challenging: it is difficult to collect enough annotated data to enablelearning of this reasoning process from scratch, and also difficult toimplement the reasoning process using generic sequence models. Here we describean approach to vision-and-language navigation that addresses both these issueswith an embedded speaker model. We use this speaker model to (1) synthesize newinstructions for data augmentation and to (2) implement pragmatic reasoning,which evaluates how well candidate action sequences explain an instruction.Both steps are supported by a panoramic action space that reflects thegranularity of human-generated instructions. Experiments show that all threecomponents of this approach---speaker-driven data augmentation, pragmaticreasoning and panoramic action space---dramatically improve the performance ofa baseline instruction follower, more than doubling the success rate over thebest existing approach on a standard benchmark.

Quick Read (beta)

loading the full paper ...