Abstract
We present AvatarPopUp, a method for fast, high quality 3D human avatargeneration from different input modalities, such as images and text prompts andwith control over the generated pose and shape. The common theme is the use ofdiffusion-based image generation networks that are specialized for eachparticular task, followed by a 3D lifting network. We purposefully decouple thegeneration from the 3D modeling which allow us to leverage powerful imagesynthesis priors, trained on billions of text-image pairs. We fine-tune latentdiffusion networks with additional image conditioning to solve tasks such asimage generation and back-view prediction, and to support qualitativelydifferent multiple 3D hypotheses. Our partial fine-tuning approach allows toadapt the networks for each task without inducing catastrophic forgetting. Inour experiments, we demonstrate that our method produces accurate, high-quality3D avatars with diverse appearance that respect the multimodal text, image, andbody control signals. Our approach can produce a 3D model in as few as 2seconds, a four orders of magnitude speedup w.r.t. the vast majority ofexisting methods, most of which solve only a subset of our tasks, and withfewer controls, thus enabling applications that require the controlled 3Dgeneration of human avatars at scale. The project website can be found athttps://www.nikoskolot.com/avatarpopup/.