HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Abstract

Human-motion video generation has been a challenging task, primarily due tothe difficulty inherent in learning human body movements. While some approacheshave attempted to drive human-centric video generation explicitly through posecontrol, these methods typically rely on poses derived from existing videos,thereby lacking flexibility. To address this, we propose HumanDreamer, adecoupled human video generation framework that first generates diverse posesfrom text prompts and then leverages these poses to generate human-motionvideos. Specifically, we propose MotionVid, the largest dataset forhuman-motion pose generation. Based on the dataset, we present MotionDiT, whichis trained to generate structured human-motion poses from text prompts.Besides, a novel LAMA loss is introduced, which together contribute to asignificant improvement in FID by 62.4%, along with respective enhancements inR-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, therebyadvancing both the Text-to-Pose control accuracy and FID metrics. Ourexperiments across various Pose-to-Video baselines demonstrate that the posesgenerated by our method can produce diverse and high-quality human-motionvideos. Furthermore, our model can facilitate other downstream tasks, such aspose sequence prediction and 2D-3D motion lifting.

Quick Read (beta)

loading the full paper ...