Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Abstract

With the release of large-scale motion datasets with textual annotations, thetask of establishing a robust latent space for language and 3D human motion hasrecently witnessed a surge of interest. Methods have been proposed to converthuman motion and texts into features to achieve accurate correspondence betweenthem. Despite these efforts to align language and motion representations, weclaim that the temporal element is often overlooked, especially for compoundactions, resulting in chronological inaccuracies. To shed light on the temporalalignment in motion-language latent spaces, we propose Chronologically AccurateRetrieval (CAR) to evaluate the chronological understanding of the models. Wedecompose textual descriptions into events, and prepare negative text samplesby shuffling the order of events in compound action descriptions. We thendesign a simple task for motion-language models to retrieve the more likelytext from the ground truth and its chronologically shuffled version. CARreveals many cases where current motion-language models fail to distinguish theevent chronology of human motion, despite their impressive performance in termsof conventional evaluation metrics. To achieve better temporal alignmentbetween text and motion, we further propose to use these texts with shuffledsequence of events as negative samples during training to reinforce themotion-language models. We conduct experiments on text-motion retrieval andtext-to-motion generation using the reinforced motion-language models, whichdemonstrate improved performance over conventional approaches, indicating thenecessity to consider temporal elements in motion-language alignment.

Quick Read (beta)

loading the full paper ...