Abstract
We present FAST NAVIGATOR, a general framework for action decoding, whichyields state-of-the-art results on the recent Room-to-Room (R2R)Vision-and-Language navigation challenge of Anderson et. al. (2018). Given anatural language instruction and photo-realistic image views of a previouslyunseen environment, the agent must navigate from a source to a target locationas quickly as possible. While all of current approaches make local actiondecisions or score entire trajectories with beam search, our frameworkseamlessly balances local and global signals when exploring the environment.Importantly, this allows us to act greedily, but use global signals tobacktrack when necessary. Our FAST framework, applied to existing models,yielded a 17% relative gain over the previous state-of-the-art, an absolute 6%gain on success rate weighted by path length (SPL).