Abstract
This paper addresses the challenge of text-conditioned streaming motiongeneration, which requires us to predict the next-step human pose based onvariable-length historical motions and incoming texts. Existing methodsstruggle to achieve streaming motion generation, e.g., diffusion models areconstrained by pre-defined motion lengths, while GPT-based methods suffer fromdelayed response and error accumulation problem due to discretized non-causaltokenization. To solve these problems, we propose MotionStreamer, a novelframework that incorporates a continuous causal latent space into aprobabilistic autoregressive model. The continuous latents mitigate informationloss caused by discretization and effectively reduce error accumulation duringlong-term autoregressive generation. In addition, by establishing temporalcausal dependencies between current and historical motion latents, our modelfully utilizes the available information to achieve accurate online motiondecoding. Experiments show that our method outperforms existing approacheswhile offering more applications, including multi-round generation, long-termgeneration, and dynamic motion composition. Project Page:https://zju3dv.github.io/MotionStreamer/