InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Abstract

Recent talking avatar generation models have made strides in achievingrealistic and accurate lip synchronization with the audio, but often fall shortin controlling and conveying detailed expressions and emotions of the avatar,making the generated video less vivid and controllable. In this paper, wepropose a novel text-guided approach for generating emotionally expressive 2Davatars, offering fine-grained control, improved interactivity, andgeneralizability to the resulting video. Our framework, named InstructAvatar,leverages a natural language interface to control the emotion as well as thefacial motion of avatars. Technically, we design an automatic annotationpipeline to construct an instruction-video paired training dataset, equippedwith a novel two-branch diffusion-based generator to predict avatars with audioand text instructions at the same time. Experimental results demonstrate thatInstructAvatar produces results that align well with both conditions, andoutperforms existing methods in fine-grained emotion control, lip-sync quality,and naturalness. Our project page ishttps://wangyuchi369.github.io/InstructAvatar/.

Quick Read (beta)

loading the full paper ...