Multi-modal Discriminative Model for Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) is a natural language grounding taskwhere agents have to interpret natural language instructions in the context ofvisual scenes in a dynamic environment to achieve prescribed navigation goals.Successful agents must have the ability to parse natural language of varyinglinguistic styles, ground them in potentially unfamiliar scenes, plan and reactwith ambiguous environmental feedback. Generalization ability is limited by theamount of human annotated data. In particular, \emph{paired} vision-languagesequence data is expensive to collect. We develop a discriminator thatevaluates how well an instruction explains a given path in VLN task usingmulti-modal alignment. Our study reveals that only a small fraction of thehigh-quality augmented data from \citet{Fried:2018:Speaker}, as scored by ourdiscriminator, is useful for training VLN agents with similar performance onpreviously unseen environments. We also show that a VLN agent warm-started withpre-trained components from the discriminator outperforms the benchmark successrates of 35.5 by 10\% relative measure on previously unseen environments.

Quick Read (beta)

loading the full paper ...