Abstract
Recent advances in diffusion-based video generation have enabledphoto-realistic short clips, but current methods still struggle to achievemulti-modal consistency when jointly generating whole-body motion and naturalspeech. Current approaches lack comprehensive eval- uation frameworks thatassess both visual and audio quality, and there are insufficient benchmarks forregion- specific performance analysis. To address these gaps, we introduce theJoint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1),comprising a large-scale multi-modal dataset with 10,000 unique identitiesacross 2 million video samples, and an evalua- tion protocol for assessingjoint audio-video generation of whole-body animatable avatars. Our evaluationof SOTA models reveals consistent performance disparities betweenface/hand-centric and whole-body performance, which incidates essential areasfor future research. The dataset and evaluation tools are publicly available athttps://github.com/deepreasonings/WholeBodyBenchmark.