Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Abstract

Audio-driven human animation methods, such as talking head and talking bodygeneration, have made remarkable progress in generating synchronized facialmovements and appealing visual quality videos. However, existing methodsprimarily focus on single human animation and struggle with multi-stream audioinputs, facing incorrect binding problems between audio and persons.Additionally, they exhibit limitations in instruction-following capabilities.To solve this problem, in this paper, we propose a novel task: Multi-PersonConversational Video Generation, and introduce a new framework, MultiTalk, toaddress the challenges during multi-person generation. Specifically, for audioinjection, we investigate several schemes and propose the Label Rotary PositionEmbedding (L-RoPE) method to resolve the audio and person binding problem.Furthermore, during training, we observe that partial parameter training andmulti-task training are crucial for preserving the instruction-followingability of the base model. MultiTalk achieves superior performance compared toother methods on several datasets, including talking head, talking body, andmulti-person datasets, demonstrating the powerful generation capabilities ofour approach.

Quick Read (beta)

loading the full paper ...