FoundPose: Unseen Object Pose Estimation with Foundation Features

Abstract

We propose FoundPose, a method for 6D pose estimation of unseen rigid objectsfrom a single RGB image. The method assumes that 3D models of the objects areavailable but does not require any object-specific training. This is achievedby building upon DINOv2, a recent vision foundation model with impressivegeneralization capabilities. An online pose estimation stage is supported by aminimal object representation that is built during a short onboarding stagefrom DINOv2 patch features extracted from rendered object templates. Given aquery image with an object segmentation mask, FoundPose first rapidly retrievesa handful of similarly looking templates by a DINOv2-based bag-of-wordsapproach. Pose hypotheses are then generated from 2D-3D correspondencesestablished by matching DINOv2 patch features between the query image and aretrieved template, and finally optimized by featuremetric refinement. Themethod can handle diverse objects, including challenging ones with symmetriesand without any texture, and noticeably outperforms existing RGB methods forcoarse pose estimation in both accuracy and speed on the standard BOPbenchmark. With the featuremetric and additional MegaPose refinement, which aredemonstrated complementary, the method outperforms all RGB competitors. Sourcecode is at: evinpinar.github.io/foundpose.

Quick Read (beta)

loading the full paper ...