TEACH: Temporal Action Composition for 3D Humans

Abstract

Given a series of natural language descriptions, our task is to generate 3Dhuman motions that correspond semantically to the text, and follow the temporalorder of the instructions. In particular, our goal is to enable the synthesisof a series of actions, which we refer to as temporal action composition. Thecurrent state of the art in text-conditioned motion synthesis only takes asingle action or a single sentence as input. This is partially due to lack ofsuitable training data containing action sequences, but also due to thecomputational complexity of their non-autoregressive model formulation, whichdoes not scale well to long sequences. In this work, we address both issues.First, we exploit the recent BABEL motion-text collection, which has a widerange of labeled actions, many of which occur in a sequence with transitionsbetween them. Next, we design a Transformer-based approach that operatesnon-autoregressively within an action, but autoregressively within the sequenceof actions. This hierarchical formulation proves effective in our experimentswhen compared with multiple baselines. Our approach, called TEACH for "TEmporalAction Compositions for Human motions", produces realistic human motions for awide variety of actions and temporal compositions from language descriptions.To encourage work on this new task, we make our code available for researchpurposes at $\href{teach.is.tue.mpg.de}{\textrm{our website}}$.

Quick Read (beta)

loading the full paper ...