Abstract
Stochastic Human Motion Prediction (HMP) aims to predict multiple possiblefuture human pose sequences from observed ones. Most prior works learn motiondistributions through encoding-decoding in the latent space, which does notpreserve motion's spatial-temporal structure. While effective, these methodsoften require complex, multi-stage training and yield predictions that areinconsistent with the provided history and can be physically unrealistic. Toaddress these issues, we propose CoMusion, a single-stage, end-to-enddiffusion-based stochastic HMP framework. CoMusion is inspired from the insightthat a smooth future pose initialization improves prediction performance, astrategy not previously utilized in stochastic models but evidenced indeterministic works. To generate such initialization, CoMusion's motionpredictor starts with a Transformer-based network for initial reconstruction ofcorrupted motion. Then, a graph convolutional network (GCN) is employed torefine the prediction considering past observations in the discrete cosinetransformation (DCT) space. Our method, facilitated by the Transformer-GCNmodule design and a proposed variance scheduler, excels in predicting accurate,realistic, and consistent motions, while maintaining appropriate diversity.Experimental results on benchmark datasets demonstrate that CoMusion surpassesprior methods across metrics, while demonstrating superior generation quality.Our Code is released at https://github.com/jsun57/CoMusion/ .