A generalizable approach for multi-view 3D human pose regression

Abstract

Despite the significant improvement in the performance of monocular poseestimation approaches and their ability to generalize to unseen environments,multi-view (MV) approaches are often lagging behind in terms of accuracy andare specific to certain datasets. This is mainly due to the fact that (1)contrary to real world single-view (SV) datasets, MV datasets are oftencaptured in controlled environments to collect precise 3D annotations, which donot cover all real world challenges, and (2) the model parameters are learnedfor specific camera setups. To alleviate these problems, we propose a two-stageapproach to detect and estimate 3D human poses, which separates SV posedetection from MV 3D pose estimation. This separation enables us to utilizeeach dataset for the right task, i.e. SV datasets for constructing robust posedetection models and MV datasets for constructing precise MV 3D regressionmodels. In addition, our 3D regression approach only requires 3D pose data andits projections to the views for building the model, hence removing the needfor collecting annotated data from the test setup. Our approach can thereforebe easily generalized to a new environment by simply projecting 3D poses into2D during training according to the camera setup used at test time. As 2D posesare collected at test time using a SV pose detector, which might generateinaccurate detections, we model its characteristics and incorporate thisinformation during training. We demonstrate that incorporating the detector'scharacteristics is important to build a robust 3D regression model and that theresulting regression model generalizes well to new MV environments. Ourevaluation results show that our approach achieves competitive results on theHuman3.6M dataset and significantly improves results on a MV clinical datasetthat is the first MV dataset generated from live surgery recordings.

Quick Read (beta)

loading the full paper ...