Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks

Abstract

Linking human whole-body motion and natural language is of great interest forthe generation of semantic representations of observed human behaviors as wellas for the generation of robot behaviors based on natural language input. Whilethere has been a large body of research in this area, most approaches thatexist today require a symbolic representation of motions (e.g. in the form ofmotion primitives), which have to be defined a-priori or require complexsegmentation algorithms. In contrast, recent advances in the field of neuralnetworks and especially deep learning have demonstrated that sub-symbolicrepresentations that can be learned end-to-end usually outperform moretraditional approaches, for applications such as machine translation. In thispaper we propose a generative model that learns a bidirectional mapping betweenhuman whole-body motion and natural language using deep recurrent neuralnetworks (RNNs) and sequence-to-sequence learning. Our approach does notrequire any segmentation or manual feature engineering and learns a distributedrepresentation, which is shared for all motions and descriptions. We evaluateour approach on 2,846 human whole-body motions and 6,187 natural languagedescriptions thereof from the KIT Motion-Language Dataset. Our results clearlydemonstrate the effectiveness of the proposed model: We show that our modelgenerates a wide variety of realistic motions only from descriptions thereof inform of a single sentence. Conversely, our model is also capable of generatingcorrect and detailed natural language descriptions from human motions.

Quick Read (beta)

loading the full paper ...