Abstract
Vision-Language Navigation (VLN) aims to guide agents by leveraging languageinstructions and visual cues, playing a pivotal role in embodied AI. Indoor VLNhas been extensively studied, whereas outdoor aerial VLN remains underexplored.The potential reason is that outdoor aerial view encompasses vast areas, makingdata collection more challenging, which results in a lack of benchmarks. Toaddress this problem, we propose OpenFly, a platform comprising variousrendering engines, a versatile toolchain, and a large-scale benchmark foraerial VLN. Firstly, we integrate diverse rendering engines and advancedtechniques for environment simulation, including Unreal Engine, GTA V, GoogleEarth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supportsreal-to-sim rendering, further enhancing the realism of our environments.Secondly, we develop a highly automated toolchain for aerial VLN datacollection, streamlining point cloud acquisition, scene semantic segmentation,flight trajectory creation, and instruction generation. Thirdly, based on thetoolchain, we construct a large-scale aerial VLN dataset with 100ktrajectories, covering diverse heights and lengths across 18 scenes. Moreover,we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing keyobservations during flight. For benchmarking, extensive experiments andanalyses are conducted, evaluating several recent VLN methods and showcasingthe superiority of our OpenFly platform and agent. The toolchain, dataset, andcodes will be open-sourced.