Abstract
Contemporary monocular 6D pose estimation methods can only cope with ahandful of object instances. This naturally limits possible applications as,for instance, robots need to work with hundreds of different objects in a realenvironment. In this paper, we propose the first deep learning approach forclass-wise monocular 6D pose estimation, coupled with metric shape retrieval.We propose a new loss formulation which directly optimizes over all parameters,i.e. 3D orientation, translation, scale and shape at the same time. Instead ofdecoupling each parameter, we transform the regressed shape, in the form of apoint cloud, to 3D and directly measure its metric misalignment. Weexperimentally demonstrate that we can retrieve precise metric point cloudsfrom a single image, which can also be further processed for e.g. subsequentrendering. Moreover, we show that our new 3D point cloud loss outperforms allbaselines and gives overall good results despite the inherent ambiguity due tomonocular data.