Abstract
Estimating rigid objects' poses is one of the fundamental problems incomputer vision, with a range of applications across automation and augmentedreality. Most existing approaches adopt one network per object class strategy,depend heavily on objects' 3D models, depth data, and employ a time-consumingiterative refinement, which could be impractical for some applications. Thispaper presents a novel approach, CVAM-Pose, for multi-object monocular poseestimation that addresses these limitations. The CVAM-Pose method employs alabel-embedded conditional variational autoencoder network, to implicitlyabstract regularised representations of multiple objects in a singlelow-dimensional latent space. This autoencoding process uses only imagescaptured by a projective camera and is robust to objects' occlusion and sceneclutter. The classes of objects are one-hot encoded and embedded throughout thenetwork. The proposed label-embedded pose regression strategy interprets thelearnt latent space representations utilising continuous pose representations.Ablation tests and systematic evaluations demonstrate the scalability andefficiency of the CVAM-Pose method for multi-object scenarios. The proposedCVAM-Pose outperforms competing latent space approaches. For example, it isrespectively 25% and 20% better than AAE and Multi-Path methods, when evaluatedusing the $\mathrm{AR_{VSD}}$ metric on the Linemod-Occluded dataset. It alsoachieves results somewhat comparable to methods reliant on 3D models reportedin BOP challenges. Code available: https://github.com/JZhao12/CVAM-Pose