Abstract
Visual navigation with an image as goal is a fundamental and challengingproblem. Conventional methods either rely on end-to-end RL learning ormodular-based policy with topological graph or BEV map as memory, which cannotfully model the geometric relationship between the explored 3D environment andthe goal image. In order to efficiently and accurately localize the goal imagein 3D space, we build our navigation system upon the renderable 3D gaussian(3DGS) representation. However, due to the computational intensity of 3DGSoptimization and the large search space of 6-DoF camera pose, directlyleveraging 3DGS for image localization during agent exploration process isprohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3DGaussian Localization framework for efficient and 3D-aware image-goalnavigation. Specifically, we incrementally update the scene representation asnew images arrive with feed-forward monocular prediction. Then we coarselylocalize the goal by leveraging the geometric information for discrete spacematching, which can be equivalent to efficient 3D convolution. When the agentis close to the goal, we finally solve the fine target pose with optimizationvia differentiable rendering. The proposed IGL-Nav outperforms existingstate-of-the-art methods by a large margin across diverse experimentalconfigurations. It can also handle the more challenging free-view image-goalsetting and be deployed on real-world robotic platform using a cellphone tocapture goal image at arbitrary pose. Project page:https://gwxuan.github.io/IGL-Nav/.