AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Abstract

We explore the task of geometric reconstruction of images captured from amixture of ground and aerial views. Current state-of-the-art learning-basedapproaches fail to handle the extreme viewpoint variation between aerial-groundimage pairs. Our hypothesis is that the lack of high-quality, co-registeredaerial-ground datasets for training is a key reason for this failure. Such datais difficult to assemble precisely because it is difficult to reconstruct in ascalable way. To overcome this challenge, we propose a scalable frameworkcombining pseudo-synthetic renderings from 3D city-wide meshes (e.g., GoogleEarth) with real, ground-level crowd-sourced images (e.g., MegaDepth). Thepseudo-synthetic data simulates a wide range of aerial viewpoints, while thereal, crowd-sourced images help improve visual fidelity for ground-level imageswhere mesh-based renderings lack sufficient detail, effectively bridging thedomain gap between real images and pseudo-synthetic renderings. Using thishybrid dataset, we fine-tune several state-of-the-art algorithms and achievesignificant improvements on real-world, zero-shot aerial-ground tasks. Forexample, we observe that baseline DUSt3R localizes fewer than 5% ofaerial-ground pairs within 5 degrees of camera rotation error, whilefine-tuning with our data raises accuracy to nearly 56%, addressing a majorfailure point in handling large viewpoint changes. Beyond camera estimation andscene reconstruction, our dataset also improves performance on downstream taskslike novel-view synthesis in challenging aerial-ground scenarios, demonstratingthe practical value of our approach in real-world applications.

Quick Read (beta)

loading the full paper ...