Abstract
Recent advances in 3D human motion and language integration have primarilyfocused on text-to-motion generation, leaving the task of motion understandingrelatively unexplored. We introduce Dense Motion Captioning, a novel task thataims to temporally localize and caption actions within 3D human motionsequences. Current datasets fall short in providing detailed temporalannotations and predominantly consist of short sequences featuring few actions.To overcome these limitations, we present the Complex Motion Dataset (CompMo),the first large-scale dataset featuring richly annotated, complex motionsequences with precise temporal boundaries. Built through a carefully designeddata generation pipeline, CompMo includes 60,000 motion sequences, eachcomposed of multiple actions ranging from at least two to ten, accuratelyannotated with their temporal extents. We further present DEMO, a model thatintegrates a large language model with a simple motion adapter, trained togenerate dense, temporally grounded captions. Our experiments show that DEMOsubstantially outperforms existing methods on CompMo as well as on adaptedbenchmarks, establishing a robust baseline for future research in 3D motionunderstanding and captioning.