UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

Abstract

A core challenge in AI-guided autonomy is enabling agents to navigaterealistically and effectively in previously unseen environments based onnatural language commands. We propose UAV-VLN, a novel end-to-endVision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs)that seamlessly integrates Large Language Models (LLMs) with visual perceptionto facilitate human-interactive navigation. Our system interprets free-formnatural language instructions, grounds them into visual observations, and plansfeasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parsehigh-level semantic goals, while a vision model detects and localizessemantically relevant objects in the environment. By fusing these modalities,the UAV can reason about spatial relationships, disambiguate references inhuman instructions, and plan context-aware behaviors with minimal task-specificsupervision. To ensure robust and interpretable decision-making, the frameworkincludes a cross-modal grounding mechanism that aligns linguistic intent withvisual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios,demonstrating its ability to generalize to novel instructions and environmentswith minimal task-specific training. Our results show significant improvementsin instruction-following accuracy and trajectory efficiency, highlighting thepotential of LLM-driven vision-language interfaces for safe, intuitive, andgeneralizable UAV autonomy.

Quick Read (beta)

loading the full paper ...