Learning 3D Human Dynamics from Video

Abstract

From an image of a person in action, we can easily guess the 3D motion of theperson in the immediate past and future. This is because we have a mental modelof 3D human dynamics that we have acquired from observing visual sequences ofhumans in motion. We present a framework that can similarly learn arepresentation of 3D dynamics of humans from video via a simple but effectivetemporal encoding of image features. At test time, from video, the learnedtemporal representation can recover smooth 3D mesh predictions. From a singleimage, our model can recover the current 3D mesh as well as its 3D past andfuture motion. Our approach is designed so it can learn from videos with 2Dpose annotations in a semi-supervised manner. However, annotated data is alwayslimited. On the other hand, there are millions of videos uploaded daily on theInternet. In this work, we harvest this Internet-scale source of unlabeled databy training our model on them with pseudo-ground truth 2D pose obtained from anoff-the-shelf 2D pose detector. Our experiments show that adding more videoswith pseudo-ground truth 2D pose monotonically improves 3D predictionperformance. We evaluate our model on the recent challenging dataset of 3DPoses in the Wild and obtain state-of-the-art performance on the 3D predictiontask without any fine-tuning. The project website with video can be found athttps://akanazawa.github.io/human_dynamics/.

Quick Read (beta)

loading the full paper ...