Abstract
We introduce a hierarchical probabilistic approach to go from a 2D image tomultiview 3D: a diffusion "prior" predicts the unseen 3D geometry, which thenconditions a diffusion "decoder" to generate novel views of the subject. We usea pointmap-based geometric representation to coordinate the generation ofmultiple target views simultaneously. We construct a predictable distributionof geometric features per target view to enable learnability across examples,and generalization to arbitrary inputs images. Our modular, geometry-drivenapproach to novel-view synthesis (called "unPIC") beats competing baselinessuch as CAT3D, EscherNet, Free3D, and One-2-3-45 on held-out objects fromObjaverseXL, as well as unseen real-world objects from Google Scanned Objects,Amazon Berkeley Objects, and the Digital Twin Catalog.