Abstract
We present FoundationPose, a unified foundation model for 6D object poseestimation and tracking, supporting both model-based and model-free setups. Ourapproach can be instantly applied at test-time to a novel object withoutfine-tuning, as long as its CAD model is given, or a small number of referenceimages are captured. We bridge the gap between these two setups with a neuralimplicit representation that allows for effective novel view synthesis, keepingthe downstream pose estimation modules invariant under the same unifiedframework. Strong generalizability is achieved via large-scale synthetictraining, aided by a large language model (LLM), a novel transformer-basedarchitecture, and contrastive learning formulation. Extensive evaluation onmultiple public datasets involving challenging scenarios and objects indicateour unified approach outperforms existing methods specialized for each task bya large margin. In addition, it even achieves comparable results toinstance-level methods despite the reduced assumptions. Project page:https://nvlabs.github.io/FoundationPose/