ControlMM: Controllable Masked Motion Generation

Abstract

Recent advances in motion diffusion models have enabled spatiallycontrollable text-to-motion generation. However, despite achieving acceptablecontrol precision, these models suffer from generation speed and fidelitylimitations. To address these challenges, we propose ControlMM, a novelapproach incorporating spatial control signals into the generative maskedmotion model. ControlMM achieves real-time, high-fidelity, and high-precisioncontrollable motion generation simultaneously. Our approach introduces two keyinnovations. First, we propose masked consistency modeling, which ensureshigh-fidelity motion generation via random masking and reconstruction, whileminimizing the inconsistency between the input control signals and theextracted control signals from the generated motion. To further enhance controlprecision, we introduce inference-time logit editing, which manipulates thepredicted conditional motion distribution so that the generated motion, sampledfrom the adjusted distribution, closely adheres to the input control signals.During inference, ControlMM enables parallel and iterative decoding of multiplemotion tokens, allowing for high-speed motion generation. Extensive experimentsshow that, compared to the state of the art, ControlMM delivers superiorresults in motion quality, with better FID scores (0.061 vs 0.271), and highercontrol precision (average error 0.0091 vs 0.0108). ControlMM generates motions20 times faster than diffusion-based methods. Additionally, ControlMM unlocksdiverse applications such as any joint any frame control, body part timelinecontrol, and obstacle avoidance. Video visualization can be found athttps://exitudio.github.io/ControlMM-page

Quick Read (beta)

loading the full paper ...