Abstract
Recent research efforts enable study for natural language grounded navigationin photo-realistic environments, e.g., following natural language instructionsor dialog. However, existing methods tend to overfit training data in seenenvironments and fail to generalize well in previously unseen environments. Toclose the gap between seen and unseen environments, we aim at learning ageneralized navigation model from two novel perspectives: (1) we introduce amultitask navigation model that can be seamlessly trained on bothVision-Language Navigation (VLN) and Navigation from Dialog History (NDH)tasks, which benefits from richer natural language guidance and effectivelytransfers knowledge across tasks; (2) we propose to learn environment-agnosticrepresentations for the navigation policy that are invariant among theenvironments seen during training, thus generalizing better on unseenenvironments. Extensive experiments show that environment-agnostic multitasklearning significantly reduces the performance gap between seen and unseenenvironments, and the navigation agent trained so outperforms baselines onunseen environments by 16% (relative measure on success rate) on VLN and 120%(goal progress) on NDH. Our submission to the CVDN leaderboard establishes anew state-of-the-art for the NDH task on the holdout test set. Code isavailable at https://github.com/google-research/valan.