GENMO: A GENeralist Model for Human MOtion

Abstract

Human motion modeling traditionally separates motion generation andestimation into distinct tasks with specialized models. Motion generationmodels focus on creating diverse, realistic motions from inputs like text,audio, or keyframes, while motion estimation models aim to reconstruct accuratemotion trajectories from observations like videos. Despite sharing underlyingrepresentations of temporal dynamics and kinematics, this separation limitsknowledge transfer between tasks and requires maintaining separate models. Wepresent GENMO, a unified Generalist Model for Human Motion that bridges motionestimation and generation in a single framework. Our key insight is toreformulate motion estimation as constrained motion generation, where theoutput motion must precisely satisfy observed conditioning signals. Leveragingthe synergy between regression and diffusion, GENMO achieves accurate globalmotion estimation while enabling diverse motion generation. We also introducean estimation-guided training objective that exploits in-the-wild videos with2D annotations and text descriptions to enhance generative diversity.Furthermore, our novel architecture handles variable-length motions and mixedmultimodal conditions (text, audio, video) at different time intervals,offering flexible control. This unified approach creates synergistic benefits:generative priors improve estimated motions under challenging conditions likeocclusions, while diverse video data enhances generation capabilities.Extensive experiments demonstrate GENMO's effectiveness as a generalistframework that successfully handles multiple human motion tasks within a singlemodel.

Quick Read (beta)

loading the full paper ...