Abstract
Object pose estimation is a core means for robots to understand and interactwith their environment. For this task, monocular category-level methods areattractive as they require only a single RGB camera. However, current methodsrely on shape priors or CAD models of the intra-class known objects. We proposea diffusion-based monocular category-level 9D object pose generation method,MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusionmodels to alleviate the need for shape priors, CAD models, or depth sensors forintra-class unknown object pose estimation. We first estimate coarse depth viaDINOv2 from the monocular image in a zero-shot manner and convert it into apoint cloud. We then fuse the global features of the point cloud with the inputimage and use the fused features along with the encoded time step to conditionMonoDiff9D. Finally, we design a transformer-based denoiser to recover theobject pose from Gaussian noise. Extensive experiments on two popular benchmarkdatasets show that MonoDiff9D achieves state-of-the-art monocularcategory-level 9D object pose estimation accuracy without the need for shapepriors or CAD models at any stage. Our code will be made public athttps://github.com/CNJianLiu/MonoDiff9D.