Abstract
Conditional human motion generation is an important topic with manyapplications in virtual reality, gaming, and robotics. While prior works havefocused on generating motion guided by text, music, or scenes, these typicallyresult in isolated motions confined to short durations. Instead, we address thegeneration of long, continuous sequences guided by a series of varying textualdescriptions. In this context, we introduce FlowMDM, the first diffusion-basedmodel that generates seamless Human Motion Compositions (HMC) without anypostprocessing or redundant denoising steps. For this, we introduce the BlendedPositional Encodings, a technique that leverages both absolute and relativepositional encodings in the denoising chain. More specifically, global motioncoherence is recovered at the absolute stage, whereas smooth and realistictransitions are built at the relative stage. As a result, we achievestate-of-the-art results in terms of accuracy, realism, and smoothness on theBabel and HumanML3D datasets. FlowMDM excels when trained with only a singledescription per motion sequence thanks to its Pose-Centric Cross-ATtention,which makes it robust against varying text descriptions at inference time.Finally, to address the limitations of existing HMC metrics, we propose two newmetrics: the Peak Jerk and the Area Under the Jerk, to detect abrupttransitions.