Abstract
Text- or image-to-3D generators and 3D scanners can now produce 3D assetswith high-quality shapes and textures. These assets typically consist of asingle, fused representation, like an implicit neural field, a Gaussianmixture, or a mesh, without any useful structure. However, most applicationsand creative workflows require assets to be made of several meaningful partsthat can be manipulated independently. To address this gap, we introducePartGen, a novel approach that generates 3D objects composed of meaningfulparts starting from text, an image, or an unstructured 3D object. First, givenmultiple views of a 3D object, generated or rendered, a multi-view diffusionmodel extracts a set of plausible and view-consistent part segmentations,dividing the object into parts. Then, a second multi-view diffusion model takeseach part separately, fills in the occlusions, and uses those completed viewsfor 3D reconstruction by feeding them to a 3D reconstruction network. Thiscompletion process considers the context of the entire object to ensure thatthe parts integrate cohesively. The generative completion model can make up forthe information missing due to occlusions; in extreme cases, it can hallucinateentirely invisible parts based on the input 3D asset. We evaluate our method ongenerated and real 3D assets and show that it outperforms segmentation andpart-extraction baselines by a large margin. We also showcase downstreamapplications such as 3D part editing.