Abstract
Vision-and-Language Navigation (VLN) agents are tasked with navigating anunseen environment using natural language instructions. In this work, we studyif visual representations of sub-goals implied by the instructions can serve asnavigational cues and lead to increased navigation performance. To synthesizethese visual representations or imaginations, we leverage a text-to-imagediffusion model on landmark references contained in segmented instructions.These imaginations are provided to VLN agents as an added modality to act aslandmark cues and an auxiliary loss is added to explicitly encourage relatingthese with their corresponding referring expressions. Our findings reveal anincrease in success rate (SR) of around 1 point and up to 0.5 points in successscaled by inverse path length (SPL) across agents. These results suggest thatthe proposed approach reinforces visual understanding compared to relying onlanguage instructions alone. Code and data for our work can be found athttps://www.akhilperincherry.com/VLN-Imagine-website/.