Abstract
We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieveszero-shot generalization across diverse cameras and environments, overcominglimitations in existing methods that depend on predefined or static cameracalibration setups. Our approach incorporates three main innovations. First, wedesign a calibration-free, geometry-aware network structure capable of handlingnoise in estimated depth and camera parameters. Second, we introduce alanguage-based prior that infuses semantic information to enhance robustfeature extraction and generalization to previously unseen domains. Third, wedevelop a flexible, semi-supervised training paradigm that iteratively adaptsto new scenes using unlabeled data, further boosting the models' ability togeneralize across diverse real-world scenarios. We analyze complex autonomousdriving contexts, demonstrating over 30% improvement against prior methods onthree standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newlyintroduced, high-fidelity synthetic dataset derived from Grand Theft Auto(GTA). By not requiring fine-tuning or camera calibration, our work broadensthe applicability of VO, providing a versatile solution for real-worlddeployment at scale.