Abstract
We present X-Dancer, a novel zero-shot music-driven image animation pipelinethat creates diverse and long-range lifelike human dance videos from a singlestatic image. As its core, we introduce a unified transformer-diffusionframework, featuring an autoregressive transformer model that synthesizeextended and music-synchronized token sequences for 2D body, head and handsposes, which then guide a diffusion model to produce coherent and realisticdance video frames. Unlike traditional methods that primarily generate humanmotion in 3D, X-Dancer addresses data limitations and enhances scalability bymodeling a wide spectrum of 2D dance motions, capturing their nuanced alignmentwith musical beats through readily available monocular videos. To achieve this,we first build a spatially compositional token representation from 2D humanpose labels associated with keypoint confidences, encoding both largearticulated body movements (e.g., upper and lower body) and fine-grainedmotions (e.g., head and hands). We then design a music-to-motion transformermodel that autoregressively generates music-aligned dance pose token sequences,incorporating global attention to both musical style and prior motion context.Finally we leverage a diffusion backbone to animate the reference image withthese synthesized pose tokens through AdaIN, forming a fully differentiableend-to-end framework. Experimental results demonstrate that X-Dancer is able toproduce both diverse and characterized dance videos, substantiallyoutperforming state-of-the-art methods in term of diversity, expressiveness andrealism. Code and model will be available for research purposes.