Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

Abstract

Vision-and-language navigation (VLN) is a task in which an agent is embodiedin a realistic 3D environment and follows an instruction to reach the goalnode. While most of the previous studies have built and investigated adiscriminative approach, we notice that there are in fact two possibleapproaches to building such a VLN agent: discriminative and generative. In thispaper, we design and investigate a generative language-grounded policy whichcomputes the distribution over all possible instructions given action and thetransition history. In experiments, we show that the proposed generativeapproach outperforms the discriminative approach in the Room-2-Room (R2R)dataset, especially in the unseen environments. We further show that thecombination of the generative and discriminative policies achieves close to thestate-of-the art results in the R2R dataset, demonstrating that the generativeand discriminative policies capture the different aspects of VLN.

Quick Read (beta)

loading the full paper ...