AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Abstract

3D avatar creation plays a crucial role in the digital age. However, thewhole production process is prohibitively time-consuming and labor-intensive.To democratize this technology to a larger audience, we propose AvatarCLIP, azero-shot text-driven framework for 3D avatar generation and animation. Unlikeprofessional software that requires expert knowledge, AvatarCLIP empowerslayman users to customize a 3D avatar with the desired shape and texture, anddrive the avatar with the described motions using solely natural languages. Ourkey insight is to take advantage of the powerful vision-language model CLIP forsupervising neural human generation, in terms of 3D geometry, texture andanimation. Specifically, driven by natural language descriptions, we initialize3D human geometry generation with a shape VAE network. Based on the generated3D human shapes, a volume rendering model is utilized to further facilitategeometry sculpting and texture generation. Moreover, by leveraging the priorslearned in the motion VAE, a CLIP-guided reference-based motion synthesismethod is proposed for the animation of the generated 3D avatar. Extensivequalitative and quantitative experiments validate the effectiveness andgeneralizability of AvatarCLIP on a wide range of avatars. Remarkably,AvatarCLIP can generate unseen 3D avatars with novel animations, achievingsuperior zero-shot capability.

Quick Read (beta)

loading the full paper ...