We present Vision-based Navigation with Language-based Assistance (VNLA), agrounded vision-language task where an agent with visual perception is guidedvia language to find objects in photorealistic indoor environments. The taskemulates a real-world scenario in that (a) the requester may not know how tonavigate to the target objects and thus makes requests by only specifyinghigh-level end-goals, and (b) the agent is capable of sensing when it is lostand querying an advisor, who is more qualified at the task, to obtain languagesubgoals to make progress. To model language-based assistance, we develop ageneral framework termed Imitation Learning with Indirect Intervention (I3L),and propose a solution that is effective on the VNLA task. Empirical resultsshow that this approach significantly improves the success rate of the learningagent over other baselines in both seen and unseen environments. Our code anddata are publicly available at https://github.com/debadeepta/vnla .