Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

Abstract

Vision-and-language navigation (VLN) is a long-standing challenge inautonomous robotics, aiming to empower agents with the ability to follow humaninstructions while navigating complex environments. Two key bottlenecks remainin this field: generalization to out-of-distribution environments and relianceon fixed discrete action spaces. To address these challenges, we proposeVision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles(UAVs) to execute language-guided flight. Without the requirement forlocalization or active ranging sensors, VLFly outputs continuous velocitycommands purely from egocentric observations captured by an onboard monocularcamera. The VLFly integrates three modules: an instruction encoder based on alarge language model (LLM) that reformulates high-level language intostructured prompts, a goal retriever powered by a vision-language model (VLM)that matches these prompts to goal images via vision-language similarity, and awaypoint planner that generates executable trajectories for real-time UAVcontrol. VLFly is evaluated across diverse simulation environments withoutadditional fine-tuning and consistently outperforms all baselines. Moreover,real-world VLN tasks in indoor and outdoor environments under direct andindirect instructions demonstrate that VLFly achieves robust open-vocabularygoal understanding and generalized navigation capabilities, even in thepresence of abstract language input.

Quick Read (beta)

loading the full paper ...