Abstract
We introduce AV-Flow, an audio-visual generative model that animatesphoto-realistic 4D talking avatars given only text input. In contrast to priorwork that assumes an existing speech signal, we synthesize speech and visionjointly. We demonstrate human-like speech synthesis, synchronized lip motion,lively facial expressions and head pose; all generated from just textcharacters. The core premise of our approach lies in the architecture of ourtwo parallel diffusion transformers. Intermediate highway connections ensurecommunication between the audio and visual modalities, and thus, synchronizedspeech intonation and facial dynamics (e.g., eyebrow motion). Our model istrained with flow matching, leading to expressive results and fast inference.In case of dyadic conversations, AV-Flow produces an always-on avatar, thatactively listens and reacts to the audio-visual input of a user. Throughextensive experiments, we show that our method outperforms prior work,synthesizing natural-looking 4D talking avatars. Project page:https://aggelinacha.github.io/AV-Flow/