Abstract
Language plays a vital role in the realm of human motion. Existing methodshave largely depended on CLIP text embeddings for motion generation, yet theyfall short in effectively aligning language and motion due to CLIP'spretraining on static image-text pairs. This work introduces LaMP, a novelLanguage-Motion Pretraining model, which transitions from a language-vision toa more suitable language-motion latent space. It addresses key limitations bygenerating motion-informative text embeddings, significantly enhancing therelevance and semantics of generated motion sequences. With LaMP, we advancethree key tasks: text-to-motion generation, motion-text retrieval, and motioncaptioning through aligned language-motion representation learning. Forgeneration, we utilize LaMP to provide the text condition instead of CLIP, andan autoregressive masked prediction is designed to achieve mask modelingwithout rank collapse in transformers. For retrieval, motion features fromLaMP's motion transformer interact with query tokens to retrieve text featuresfrom the text transformer, and vice versa. For captioning, we finetune a largelanguage model with the language-informative motion features to develop astrong motion captioning model. In addition, we introduce the LaMP-BertScoremetric to assess the alignment of generated motions with textual descriptions.Extensive experimental results on multiple datasets demonstrate substantialimprovements over previous methods across all three tasks. The code of ourmethod will be made public.