Abstract
Despite the significant improvement in the performance of monocular poseestimation approaches and their ability to generalize to unseen environments,multiview (MV) approaches are often lagging behind in terms of accuracy andare specific to certain datasets. This is mainly due to the fact that (1)contrary to real world singleview (SV) datasets, MV datasets are oftencaptured in controlled environments to collect precise 3D annotations, which donot cover all real world challenges, and (2) the model parameters are learnedfor specific camera setups. To alleviate these problems, we propose a twostageapproach to detect and estimate 3D human poses, which separates SV posedetection from MV 3D pose estimation. This separation enables us to utilizeeach dataset for the right task, i.e. SV datasets for constructing robust posedetection models and MV datasets for constructing precise MV 3D regressionmodels. In addition, our 3D regression approach only requires 3D pose data andits projections to the views for building the model, hence removing the needfor collecting annotated data from the test setup. Our approach can thereforebe easily generalized to a new environment by simply projecting 3D poses into2D during training according to the camera setup used at test time. As 2D posesare collected at test time using a SV pose detector, which might generateinaccurate detections, we model its characteristics and incorporate thisinformation during training. We demonstrate that incorporating the detector'scharacteristics is important to build a robust 3D regression model and that theresulting regression model generalizes well to new MV environments. Ourevaluation results show that our approach achieves competitive results on theHuman3.6M dataset and significantly improves results on a MV clinical datasetthat is the first MV dataset generated from live surgery recordings.
Quick Read (beta)
A generalizable approach for multiview 3D human pose regression
Abstract
Despite the significant improvement in the performance of monocular pose estimation approaches and their ability to generalize to unseen environments, multiview approaches are often lagging behind in terms of accuracy and are specific to certain datasets. This is mainly due to the fact that (1) contrary to real world singleview datasets, multiview datasets are often captured in controlled environments to collect precise 3D annotations, which do not cover all real world challenges, and (2) the model parameters are learned for specific camera setups. To alleviate these problems, we propose a twostage approach to detect and estimate 3D human poses, which separates singleview pose detection from multiview 3D pose estimation. This separation enables us to utilize each dataset for the right task, i.e. singleview datasets for constructing robust pose detection models and multiview datasets for constructing precise multiview 3D regression models. In addition, our 3D regression approach only requires 3D pose data and its projections to the views for building the model, hence removing the need for collecting annotated data from the test setup. Our approach can therefore be easily generalized to a new environment by simply projecting 3D poses into 2D during training according to the camera setup used at test time. As 2D poses are collected at test time using a singleview pose detector, which might generate inaccurate detections, we model its characteristics and incorporate this information during training. We demonstrate that incorporating the detector’s characteristics is important to build a robust 3D regression model and that the resulting regression model generalizes well to new multiview environments. Our evaluation results show that our approach achieves competitive results on the Human3.6M dataset and significantly improves results on a multiview clinical dataset that is the first multiview dataset generated from live surgery recordings.
Keywords:
Multiview human pose estimation 3D pose regression neural networks generalizability1 Introduction
Singleview human detection and body pose estimation have enjoyed a great deal of attention over the last decades in the field of computer vision because of their importance for various applications, ranging from activity recognition to human computer interaction. More recently, the emergence of deep learning has pushed the boundaries in many fields, including computer vision. The combination of deep learning with the availability of large datasets, such as MPII Pose andriluka_cvpr2014 and MS COCO MSCOCO, has spawned many promising approaches for singleview human detection and pose estimation wei_cvpr2016; newell_nips2017; cao_cvpr2017. But the presence of clutter and occlusions degrades their performance. Capturing an environment from complementary views permits to reduce the risk of occlusions, especially in busy environments, as shown in Figure 1. In addition, the availability of calibrated multiview data greatly facilitates the process of lifting 2D scenes into 3D, which is important for many applications such as augmented reality.
Despite the inherent benefits of capturing an environment from multiple views, multiview approaches have not achieved the same level of maturity as compared to singleview approaches, mostly due to two reasons: firstly, multiview datasets are generally recorded in controlled environments in order to use motion capture systems to acquire precise 3D ground truth location data. This removes the need for the tedious and errorprone manual annotation of the abundant number of frames coming from all views for generating ground truth 3D poses. Even though there are large multiview datasets such as Human3.6.M ionescu_pami2014 and HumanEva sigal_ijcv2009, the simple backgrounds and tight clothes required by motion capture systems make these datasets trivial for 2D pose estimation methods. Monocular pose estimation approaches report low 2D body part localization errors even without finetuning chen_cvpr2017; martinez_iccv2017. For these reasons, single and multiview pose estimation models trained on datasets captured in such controlled laboratory environments do not generalize well to real world data, which is often visually much more complex due to occlusions, clutter and the presence of multiple persons in the scene. Secondly, current multiview approaches sigal_ijcv2012; dogan_iet2017; pavlakos_cvpr2017_2 learn model parameters that are specific to each multiview camera setup. In other words, to apply these approaches on a new multiview scenario, it is required to collect new annotated data that includes both multiview images and their corresponding 3D ground truth poses for the same camera setup. On the one hand, generating synthetic datasets for these approaches would require not only the generation of 3D body poses, but also of photorealistic rendering of humans with different shapes, textures and backgrounds to allow generalization to the real world, which is not a trivial task. On the other hand, generating such training data using either motion capture systems or manual annotations, especially in the case of datahungry deep learning methods, is not always feasible in uncontrolled environments and very tedious. We therefore propose an approach that benefits from existing multiview datasets to perform multiview 3D pose estimation in new multiview setups.
Our approach formulates the problem of multiview 3D pose estimation in a twostep framework: (1) singleview pose detections and (2) multiview 3D pose regression. We separate these two steps for two reasons. First, we can better exploit available singleview and multiview datasets for the right task. Singleview datasets, such as MPII Pose andriluka_cvpr2014 and MS COCO MSCOCO, include diverse and challenging frames from everyday activities or movies originating from amateur to professional recordings. Therefore, models trained on these datasets can better cope with real world challenges and generalize to new environments. But, these singleview datasets are lacking 3D annotations, contrary to multiview datasets, which often come with accurate 3D body poses. As these are however much simpler for the task of 2D pose estimation chen_cvpr2017; martinez_iccv2017, researchers have proposed methods to jointly use both single and multiview datasets in order to construct more robust 3D pose estimation models from multiple views amin_pr2014; belagiannis_mva2016. Changes in camera setups however require the retraining of the model on training data from the same camera setup. This strictly limits the deployment of the models to environments where such training data exists. The second reason for our two steps approach is that we can better generalize to new multiview environments by assuming that lifting 2D body poses into 3D is independent of the images given the 2D pose detections. This assumption implies that we do not need to collect 2D image data for training the 3D regression function and that any set of plausible 3D body poses can be used instead by computing body pose projections into 2D.
To learn a multiview 3D regression function, we propose a method that relies on a multistage neural network. The input of this network is a set of corresponding multiview 2D detections for each individual person. At test time, they are collected using a stateoftheart singleview detector. We assume that the camera system is fully calibrated and can therefore use epipolar geometry to establish the multiview correspondences per person. This process also allows us to detect the number of persons per multiview frame^{1}^{1} 1 We define a multiview frame as the set of all images captured from all views at the same time step.. This is in contrast to current multiview RGB approaches, which tackle either singleperson scenarios gall_ijcv2010; hofmann_ijcv2011 or multiperson scenarios where the number of persons is known a priori luo_icpr2010; belagiannis_eccv2014.
The proposed network consists of a series of blocks of fullyconnected layers with intermediate supervision at each block. The input to each block is the raw network input, i.e. the concatenated 2D poses, and the output from previous block if it exists. The network can therefore build a high dimensional function and refine the output of the previous block to achieve a more reliable regression function. In order to generalize to new multiview setups, we do not use images during training but construct training data solely by projecting Human3.6M’s 3D poses. We use Human3.6M because it is the largest publicly available multiview dataset and it includes men and women of different sizes. The projected 2D poses are generated according to the camera parameters used at test time. In practice, 2D poses are detected at test time using a 2D pose detector that may be noisy and inaccurate. In order to cope with these inaccuracies, we propose to perturb during training the 2D locations of the body joints by random noise that is generated based on the characteristics of the 2D detector. We also propose to incorporate a detection confidence for each body joint, computed based on the amount of noise added during training. This provides a representation for the detection confidence generated by the detector at test time. Therefore, the approach can take into account not only joint locations, but also detection precision to build a robust regression function.
We use two datasets to perform quantitative and qualitative evaluations and compare with stateoftheart results on these datasets. We first report results on the Human3.6M dataset ionescu_pami2014 to characterize the properties and the performance of our approach. This dataset includes recordings of several actions performed by professional actors of different genders. This dataset has been recorded by a fully calibrated fourview camera system and a motion capture system to collect ground truth 3D positions of the body joints. We also evaluate our approach on a challenging multiview dataset kadkhoda_wacv2017 to show the generalization ability of our approach. This dataset is generated from real surgery recordings obtained in an operating room (OR) using a threeview camera system and hence is called Multiview OR (MVOR) in the following. Our approach improves 3D body part localization on Human3.6M and significantly reduces the localization error on the multiview OR dataset without using any training data from this dataset.
The main contributions of the paper are twofold. First, we present a simple and yet accurate multiview 3D pose estimation approach that can generalize well to new multiview environments. In contrast to current stateoftheart methods, the approach exploits an existing multiview dataset to build models for new multiview environments without any need for new annotation. Second, this is the first multiview RGB approach that has been quantitatively evaluated on data captured in an unconstrained environment.
2 Related Work
Multiview segmentationbased 3D pose estimation. hofmann_ijcv2011 hofmann_ijcv2011 use foreground segmentation to estimate body silhouettes per view. Then, 3D pose candidates are obtained by matching a library of exemplars. Texture information and shape similarity across all views combined with temporal information are used to compute the final 3D poses. Similarly, gall_ijcv2010 gall_ijcv2010 propose a twolayer framework that iteratively improves foreground segmentation and retrieved body poses by incorporating both multiview and temporal information. Other approaches have deployed optical flow estimation chen_spcs2008, 2D as well as 3D motion cues Sundaresan_tip2009 and lowrank multiview feature fusion combined with sparse spectral embedding yu_nc2017 to estimate 3D poses. In contrast to our work, these approaches are only evaluated on singleperson datasets. More importantly, it is not always possible to compute foreground in cluttered environments, such as in operating rooms. Therefore, these approaches can only be evaluated on data recorded in environments with simple backgrounds.
Multiview partbased 3D pose estimation. Several multiview 3D pose estimation approaches burenius_CVPR2013; amin_bmvc2013; amin_pr2014; belagiannis_CVPR2014; belagiannis_mva2016; kadkhoda_wacv2017 have been proposed that rely on a partbased framework felzenszwalb_pami2010. This partbased framework provides an elegant formalism to optimize over different potential functions for incorporating image features, multiview cues, temporal information and body physical constraints. burenius_CVPR2013 burenius_CVPR2013 propose an approach that extends pictorial structures fischler_TC1973; felzenszwalb_ijcv2005 to multiview and to perform exact 3D inference by using simple binary pairwise potential functions. Instead, Amin et al. amin_bmvc2013; amin_pr2014 use 2D inference with more complex pairwise potentials, multiview cues and triangulation to estimate 3D poses. belagiannis_CVPR2014 belagiannis_CVPR2014 have also deployed different pairwise potentials for incorporating both body physical constraints and multiview features. This approach allows to perform approximate 3D inference by selecting a limited number of hypotheses per individual. This approach has further been extended to incorporate temporal information belagiannis_eccv2014 and to use a deep neural network based body part detector belagiannis_mva2016. Recently, pavlakos_cvpr2017_2 pavlakos_cvpr2017_2 has used deep neural network to predict body part score maps across all views and then estimated body poses by using a 3D pictorial structures approach.
In contrast to our work, all these approaches have only been evaluated on datasets recorded in constrained laboratory environments and also require the number of person to be known a priori. MVDeep3DPS presented in kadkhoda_wacv2017 is an exception, but this approach relies on multiview RGBD input to estimate 3D body poses. Additionally, all these approaches need in general to learn model parameters on data from the same camera setup. Moreover, optimizing these energy functions is demanding, especially in 3D, which makes these approaches not suitable for realtime applications. In our work, we do not require images with pose annotations from the camera setup used at test time and learn model parameters by using existing datasets. Furthermore, our approach performs both human detection and pose estimation. As our regression function uses a multilayer neural network, it runs in super realtime on a single consumer GPU card.
Singleview 3D pose estimation. Recently, many deep learning based approaches have been proposed to directly regress body poses in 3D from a monocular image or an image sequence. pavlakos_cvpr2017 pavlakos_cvpr2017 use a stack of a fully convolutional network newell_eccv2016 to iteratively compute 3D heatmaps per body parts. tekin_bmvc2016 tekin_bmvc2016 propose to learn an autoencoder that maps 3D body joints into a highdimension latent space for discovering joint dependencies and then to learn a convolutinal network that maps an image into this highdimensional pose space. In tekin_cvpr2016, motion compensation is used to align several consecutive frames and construct a rectified spatiotemporal volume that is then fed into a 3D regression function. Other approaches have built deep pose grammar representations fang_arXiv2017, skeleton map wan_arXiv2017 and multitask objectives rogez_cvpr2017; luvizon_cvpr2018 to enforce more constraints and obtain a more accurate 3D regression function. These approaches are trained on images with accurate 3D ground truth poses. The main issue is that to generate such accurate 3D annotations, motion capture systems are used in controlled laboratory environments with simple backgrounds. Models trained on such image data do not generalize well to real world scenes.
Another line of work relies on twostage methods, where 2D body parts are first predicted using 2D pose detectors wei_cvpr2016; newell_eccv2016; cao_cvpr2017 and then 3D body part locations are computed by relying on these predictions moreno_cvpr2017; chen_cvpr2017; martinez_iccv2017. In comparison with direct 3D regression approaches, these approaches benefit from the diverse, challenging and real world datasets, e.g. MS COCO and MPII Pose, to train reliable 2D pose detector models that generalize well. To compute 3D body locations, exemplarbased approaches are used by matching lower and upper body parts separately jiang_icpr2010 and by matching the whole skeleton chen_cvpr2017. More recently, moreno_cvpr2017 proposed to regress from 2D Euclidean distance matrices (EDM) to 3D EDM instead of using traditional 2Dto3D regression in the Cartesian coordinate system radwan_iccv2013; ionescu_pami2014. The regression is performed using a fully convolutional network and 3D poses are recovered via a multidimensional scaling algorithm biswas_tase2006. martinez_iccv2017 martinez_iccv2017 showed that a simple fully connected network to regress from 2D to 3D outperforms moreno_cvpr2017 and achieves stateoftheart results on Human3.6M. We also adopt a twostage framework in our multiview approach and use a fully connected network as a 2Dto3D regression function. The singleview model in martinez_iccv2017 was however trained on the output of the 2D detector used during test time. In contrast, our approach relies solely on ground truth during training and instead generates training samples that comply with the behavior of the 2D detector used at test time. This is an interesting property of our approach, which enables us to train our network on Human3.6M and test on a completely different multiview dataset.
3 Methodology
In this section, we present our proposed approach for multiview 3D pose estimation. We assume that we have a calibrated multiview system recording an environment from a set of complementary views. Our objective is to detect and predict human body poses in 3D given images captured from all views. In a probabilistic formulation, we want to compute $p(Y,\mathbb{X},\mathbb{I})$, the joint distribution over the following three random variables: (1) the 3D body poses $Y=({y}_{1},{y}_{2}\mathrm{\dots},{y}_{P})$, where $P$ is the number of body joints and ${y}_{i}\in {\mathbb{R}}^{3}$ is a body joint location in 3D; (2) the 2D body poses $\mathbb{X}=({X}_{1},{X}_{2},\mathrm{\dots},{X}_{V})$, where $V$ is the number of viewpoints and ${X}_{j}$ is the tuple of pixel coordinates indicating the body joints of a 2D pose in view $j$; and (3) all 2D images $\mathbb{I}=({I}_{1},{I}_{2},\mathrm{\dots},{I}_{V})$, where ${I}_{j}$ is the image taken from the ${j}^{th}$ viewpoint. Such a formulation makes no limiting assumption and indicates that a 3D body pose is jointly dependent on its appearance in all individual views. However, learning such a model requires collecting training data from the same multiview setup that we want to apply the model to.
Without loss of generality, we can rewrite the joint probability distribution as:
$$p(Y,\mathbb{X},\mathbb{I})=p(Y\mathbb{X},\mathbb{I}).p(\mathbb{X}\mathbb{I}).p(\mathbb{I}).$$  (1) 
To build a multiview pose estimation approach that can generalize to new environments, we make two conditionally independence assumptions. Firstly, the 3D pose $Y$ is assumed conditionally independent of images $\mathbb{I}$ given 2D poses $\mathbb{X}$. Obviously, this is not always correct, as one can find different 3D skeletons that have similar 2D projections due to the 3D2D perspective effect. The likelihood of such cases however degrades dramatically in a multiview setup, where a working volume has been captured from complementary views.
Secondly, we assume that given an image observation for a view $j$, 2D poses in this view are conditionally independent of detections in the other views and other image observations. One can see that this assumption does not hold in case of occlusions. But, we believe that this assumption is reasonable for these three reasons: (1) there exist challenging singleview datasets, e.g. MS COCO and MPII Pose, which can be used to train robust singleview pose detection models; (2) recent deep neural network based approaches have achieved very promising results on unseen data and reliably discriminate occluded joints from visible ones cao_cvpr2017; newell_eccv2016; newell_nips2017; and (3) it yields an interesting modeling that allows us to train a 2D pose detector independently. Considering these two assumptions, we can rewrite the joint probability as:
$$p(Y,\mathbb{X},\mathbb{I})=p(Y\mathbb{X}).\prod _{j=1}^{V}(p({X}_{j}{I}_{j}).p({I}_{j})).$$  (2) 
This equation indicates that a 2D pose detector is applied in each view independently and that the 3D pose regression function is solely dependent upon 2D pose detections. We model the first term using a multiview 3D regression function, described in Section 3.4. The input for this function is provided by concatenating 2D detections for each individual person across all views, which is presented in Section 3.2. The second term is the singleview pose detector explained next.
3.1 Singleview 2D Pose Detector
The relaxation assumption mentioned above allows us to use arbitrary complex models to detect and localize 2D body poses given singleview images. We therefore use the deep convolutional network of cao_cvpr2017 as singleview pose detector. This approach is currently the stateoftheart approach for multiperson 2D pose estimation. In addition to its reliable multiperson pose estimation performance, the approach runs in nearly realtime. Given an image, the model generates a set of 2D poses, where each body pose is specified by a collection of 18 body parts. For each body part, the model provides its pixel coordinate and a detection confidence. The confidence values are in range $[0,1]$, where zero indicates undetected body parts.
3.2 Concatenating Detections Across all Views
Given the detected poses per view, we need to find correspondences across the views. As we assume that the camera system is fully calibrated (i.e. both camera intrinsic and extrinsic parameters are available), we use epipolar geometry to find correspondences Hartley00. Let us assume that for each pair of cameras $(C,{C}^{{}^{\prime}})$ the camera parameters are given with respect to the first one:
$$C=K[I\mathrm{\U0001d7ce}]\mathrm{\U0001d68a\U0001d697\U0001d68d}{C}^{{}^{\prime}}={K}^{{}^{\prime}}[R\mathbf{t}],$$  (3) 
where $K$ and ${K}^{{}^{\prime}}$ are camera intrinsic parameters and $[A\mathbf{b}]$ indicates extrinsic parameters. We can compute the fundamental matrix $F$ by:
$$F={K}^{{}^{\prime}T}R{K}^{T}{[K{R}^{T}\mathbf{t}]}_{\times},$$  (4) 
where ${[\mathbf{b}]}_{\times}$ is the skew matrix operator. The fundamental matrix encapsulates all cameras parameters and allows us to compute the corresponding epipolar line for a point in the other view, as illustrated in Figure 2.
Here, we use the fundamental matrix to compute average distances between detected skeletons for all pairs of views. This distance is computed for each possible pair of detections from two distinct views as the average distance between a subset of body joints detected in both skeletons. We collect 2D skeletons for each person across two views by computing the average distances between detected skeletons in one view and the corresponding epipolar lines of skeletons from the other view and by then finding disjoint pairs of skeletons with the lowest average distance. We exclude pairs for which the average distance is bigger than 20 pixels. We then use the matched skeletons to establish multiview correspondences per individual person. One should note that despite the availability of the correspondences, we cannot use triangulation because inaccurate detections lead to high error in 3D and, more importantly, joints might be detected in less than two views, especially in cluttered environments. We therefore use a regression function to compute the 3D positions of the body joints.
To prepare the input for the regression function, we concatenate skeletons across all views. If a person is not detected in a view, we fill the corresponding entry with zeros. Each body part is represented by three channels: two channels indicating pixel location and the third channel indicating the detection confidence.
3.3 Training Data Generation
As mentioned in the introduction, we generate training samples by projecting 3D skeletons into 2D. The model can therefore be trained on data generated from existing datasets or any set of valid 3D poses. The projected 2D skeletons are computed based on the camera setup used at test time. Since the singleview 2D pose detector used at test time can provide noisy detections, the model needs to be trained on similar noisy detection data to be able to generalize. We therefore evaluate our 2D pose detector on the Human3.6M dataset, which contains both images and ground truth 2D poses, to characterize its performance. We use these evaluation results to design a normally distributed noise model for each body joint. This noise is used to perturb training data. We then compute the confidence for the joint as:
$$conf=\mathrm{max}(1\frac{w}{\lambda .\sigma},0),$$  (5) 
where $w$ is the amount of additive noise, which is sampled from a normal distribution with zero mean and standard deviation $\sigma $, and $\lambda $ is a coefficient. We use this coefficient to set the confidence of a joint to zero, i.e. undetected, based on the relative amount of added noise with respect to the standard deviation. We use the evaluation results of cao_cvpr2017 on Human3.6M, presented in Section4.3, to set these parameters. As shown by the experiments, perturbing trained data and incorporating the confidence value are important for the method to generalize well to unseen data.
3.4 Multiview 3D Regression Function
As mentioned earlier, the regression function relies solely on the detections provided by the singleview 2D pose detector. In contrast to tekin_bmvc2016; fang_arXiv2017; luvizon_cvpr2018, we do not need to model a complex function to directly map image pixel intensities into body part locations in 3D. Similar to martinez_iccv2017, we model the 3D regression function using a simple multistage multilayer neural network.
The illustration of the network architecture is shown in Figure 3. The network consists of several stages, where each stage is made of four fully connected (FC) layers. The first stage takes the multiview 2D detections as input, described in Section 3.2. Every stage in this network is trained to regress for the desired output. This provides intermediate supervision at each stage and automatically alleviates the problem of vanishing gradient that happens when there are many intermediate layers between the network input and output layers cao_cvpr2017. We can therefore build deep neural networks by stacking several stages. The stagewise supervision is provided by computing the $L2$ loss between the output of the last layer in each stage and the desired output (${y}^{*}$):
$${\mathcal{L}}_{s}=\frac{1}{N}\sum _{n=1}^{N}{{y}_{n}^{s}{y}_{n}^{*}}_{2}^{2},$$  (6) 
where ${\mathcal{L}}_{s}$ is the average loss computed over all $N$ training samples used in this iteration and ${y}_{n}^{s}$ is the output of the last layer at stage $s$ for sample $n$. The network is optimized by computing the overall network loss as a sum of the losses from all $S$ stages that is defined as:
$$\mathcal{L}=\sum _{s=1}^{S}{\mathcal{L}}_{s}.$$  (7) 
Since we need to retrain the model for new multiview setups, we use batch normalization in order to reduce sensitivity to network initialization and learning rate ioffe_icml2015. We have also used dropout to avoid overfitting srivastava_jmlr2014 and rectified linear units to achieve nonlinearity nair_icml2010.
4 Experiments
In this section, we present the evaluation on two multiview datasets and compare with stateoftheart results.
4.1 Implementation Details
We implement our approach using TensorFlow tensorflow. In each stage of the network, the size of the first and last layers are set based on the input and output dimensions and the size of the intermediate layers are set to 1024. Our network is trained using the Adam optimizer. We set the starting learning rate to 0.001 and use exponential decay. The batch size is set to 512 and we train our network for 200 epochs. We observe that the performance of the network reaches a plateau when more than three stages are used. We therefore use threestage networks throughout our experiments. A forward pass takes less than 1ms on a 1080Ti GPU. We can therefore say that the computation time of our multiview regression model is almost negligible compared to the use of the 2D detector.
4.2 Datasets
Human3.6M. Human3.6M is currently the largest multiview human pose estimation dataset. The dataset includes around 3.6 million images collected from 15 actions performed by seven professional actors in a laboratory environment ionescu_pami2014. The actions have been recorded by a fourview RGB camera system and camera parameters, including both intrinsic and extrinsic parameters, are available. Fullbody 3D ground truth annotations are generated using a motion capture system. Following the standard evaluation protocol used in the literature, five subjects (S1, S5, S6, S7, S8) are used for training and two subjects (S9, S11) for testing chen_cvpr2017; pavlakos_cvpr2017; martinez_iccv2017. Mean per joint position error (MPJPE) in millimeter is used as evaluation metric and test results are collected per action.
Multiview OR. The multiview OR (MVOR) dataset is, to the best of our knowledge, the first multiview pose estimation dataset that is generated from recordings in an uncontrolled environment. All activities in an operating room have been recorded for four days using a threeview camera system kadkhoda_wacv2017. We have selected every 1500 multiview frames if there is at least one persons in one of the views. The dataset has been manually annotated to provide both 2D and 3D upperbody poses. The dataset includes around 700 multiview frames and 1100 persons. The presence of multiple persons and clutter make this dataset much more challenging than Human3.6M as can be seen in Figure 1. To report 2D body part localization on this dataset, we use the probability of correct keypoints (PCK) metric that is commonly used for evaluating multiperson pose estimation kadkhoda_wacv2017; cao_cvpr2017. MPJPE is used to report 3D body part localization.
4.3 2D Detection Results
Camera ID  Hip  Knee  Foot  Shlder  Elbow  Wrist  Avg 

54138969  16  15  14  7  11  16  13 
55011271  13  8  10  6  8  10  9 
58860488  15  13  12  7  11  18  13 
60457274  16  9  10  7  10  11  11 
Avg  15  11  11  7  10  14  11 
Head  Shlder  Elbow  Wrist  Hip  Avg  

Deep3DPS kadkhoda_wacv2017  93.4  77.0  71.5  73.7  69.1  76.9 
cao_cvpr2017 cao_cvpr2017  92.8  90.1  75.6  75.9  58.9  78.6 
In this section, we evaluate the 2D detection model of cao_cvpr2017 on both datasets to assess its performance on such unseen data. In addition, we use the results on Human3.6M to model the characteristics of the 2D detector, which are required by our data generation model presented in Section 3.3.
In Table 1, we present the results of the singleview 2D pose detector cao_cvpr2017 on the Human3.6M train set. We should note that the detector has not seen any data from this dataset during training. We use MPJPE in pixel to compute body part localization errors. The results for each body parts are reported per camera. The results for head and neck localizations are not presented as the annotation for these body parts are different between Human3.6M and MS COCO that is used to train the detector. Note that the detector is applied on the whole image, i.e. no bounding box is provided, in contrast to previous work that relies either on ground truth ionescu_pami2014; moreno_cvpr2017; martinez_iccv2017 or on person detectors tekin_cvpr2016 to obtain bounding boxes. In total, $3\%$ of the joints are not detected and the detector achieves the average MPJPE of 11 pixels. It is worth mentioning that the detector performs similarly on the test set. Table 2 presents the results of the 2D detector on the MVOR dataset. The model attains an average PCK of 78.9% on this dataset. We have also reported the performance of Deep3DPS kadkhoda_wacv2017, which is the stateoftheart model on this dataset. In contrast to cao_cvpr2017, which is trained on the RGB images of MS COCO, the Deep3DPS model uses both color and depth images and has been trained on MPI Pose and then finetuned on a singleview OR dataset. The 2D pose detector of cao_cvpr2017 outperforms Deep3DPS. These results show that the detector achieves fairly promising results on both datasets even without finetuning. Comparing the performance of the 2D detector on these two datasets also indicates that the MVOR dataset is much more complex, as the number of undetected joints is much higher ($21\%$ vs. $3\%$).
For generating the training data, the evaluation results on the train set of Human3.6M, which are reported in Table 1, are used to set the parameters of the noise model. The train set from Human3.6M is chosen to avoid any overlap between train and test sets. The coefficient $\lambda $ in (5) is set to two. As a result, $5\%$ of the joints will be labeled as undetected, which is on par with the percentage of undetected joints in Human3.6M.
4.4 3D Localization Results
Setting  Direc.  Discuss  Eat  Greet  Phone  Photo  Pose  Purch.  Sit  SitD  Smoke  Wait  Walk  WalkD  WalkT  Avg 

tekin_cvpr2016 tekin_cvpr2016  102.4  147.2  88.8  125.3  118.0  182.7  112.4  129.2  138.9  224.9  118.4  138.8  126.3  55.1  65.8  125.0 
chen_cvpr2017 chen_cvpr2017  89.9  97.6  89.9  107.9  107.3  139.2  93.6  136.0  133.1  240.1  106.6  106.2  87.0  114.0  90.5  114.1 
pavlakos_cvpr2017 pavlakos_cvpr2017  67.4  71.9  66.7  69.1  72.0  77.0  65.0  68.3  83.7  96.5  71.7  65.8  74.9  59.1  63.2  71.9 
martinez_iccv2017 martinez_iccv2017  51.8  56.2  58.1  59.0  69.5  78.4  55.2  58.1  74.0  94.6  62.3  59.1  65.1  49.5  52.4  62.9 
\hdashline$$SV, newell_eccv2016 newell_eccv2016$>$  53.4  58.6  62.1  63.2  86.2  83.3  56  58.1  81.2  101.2  68.4  64.1  67.4  51  54.2  67.2 
$$SV, cao_cvpr2017 cao_cvpr2017$>$  69.5  75.5  67.6  76.8  84.6  94.9  69.8  68.4  92.2  113.7  77.1  75.1  77.2  59.0  64.2  77.7 
\hdashline$$SV, GT$>$  94.2  113.7  96.9  106.5  119.8  127.6  86.5  149.9  145.6  222.3  113.5  111.2  120.9  92.8  92.4  119.6 
$$SV, Noisy GT$>$  69.7  78.8  69.8  77.5  84.4  97.6  64.9  86.5  103.3  125.8  81.8  80.4  83.3  59.9  62.6  81.8 
\hdashlinepavlakos_cvpr2017_2 pavlakos_cvpr2017_2  41.2  49.27  42.8  43.5  55.6  46.9  40.3  63.7  97.6  119.9  52.1  42.7  41.8  51.9  39.4  56.9 
$$MV, cao_cvpr2017 cao_cvpr2017$>$  39.4  46.9  41.0  42.7  53.6  54.8  41.4  50.0  59.9  78.8  49.8  46.2  51.1  40.5  41.0  49.1 
$$MV, GT$>$  92.1  105.8  110.1  94.0  128.2  117.0  77.0  152.2  152.0  227.5  122.9  104.3  125.1  88.7  80.9  118.5 
$$MV, Noisy GT$>$  47.1  60.5  48.7  53.5  63.5  71.1  48.7  57.8  72.2  81.7  59.0  55.9  60.6  43.4  44.3  57.9 
Human3.6M. As Human3.6M is a fairly new dataset and stateoftheart results are mainly reported using singleview models, we compare our approach with recent stateoftheart single and multiview models for 3D pose estimation on Human3.6M. For the sake of comparison, we have therefore trained a variant of our proposed regression function that relies solely on singleview input. Table 3 reports evaluation results of our approach with different configurations. Models that are relying on singleview input are denoted by SV and multiview ones by MV. These models are trained either on ground truth (GT) 2D poses, Noisy GT 2D poses as described in Section 3.3 or on 2D detections provided by either newell_eccv2016 or cao_cvpr2017 for comparison. Even though Human3.6M is a singleperson dataset, note that in tekin_cvpr2016; pavlakos_cvpr2017 the input images are cropped using bounding boxes around the persons and that the 2D pose detector models of newell_eccv2016 and wei_cvpr2016 used in chen_cvpr2017 and martinez_iccv2017 are applied on bounding boxes around the persons obtained from ground truth.
Our singleview 3D pose regression model trained on 2D detection provided by newell_eccv2016 achieves the average localization error of $67.2$ mm. We should note that our results for this model improve slightly over the results reported by martinez_iccv2017 on the same experimental setup ($67.5$), where the same 2D pose detector trained on MPII Pose is used without any finetuning on Human3.6M. martinez_iccv2017 showed that the results can be improved by finetuning the model on Human3.6M (62.9 vs. 67.5), which is in line with the results reported in chen_cvpr2017. However, in order to easily generalize to new environments, we do not finetune 2D pose detectors as this would require annotated data. Except the model $$SV, newell_eccv2016$>$, which uses the same 2D pose detector during both training and testing for the sake of fair comparison with martinez_iccv2017, all our models have used 2D detections provided by cao_cvpr2017 during testing^{2}^{2} 2 Please note that at test time 2D poses are detected using cao_cvpr2017 even in case of models trained on GT poses, which is different from martinez_iccv2017.. We should note that even though our singleview 3D regression model trained on the 2D detections provided by newell_eccv2016 performs better than other variants of our singleview model, we decide to use the model of cao_cvpr2017 instead, as it is not restricted to bounding boxes and allows us to detect and estimate 2D body poses in multiperson scenarios, e.g. the MVOR dataset.
The evaluation results show that our singleview model trained on ground truth 2D poses and the model of chen_cvpr2017 perform similarly. This indicates that our regression function that is trained on perfect GT data will eventually work similarly to the lookup table used in chen_cvpr2017. One can therefore conclude that if perfect 2D detections are obtained, a 2Dto3D regression function or a lookup table would work similarly. But, the 2D detections are not perfect in practice. Therefore, by incorporating detection noise during training as described in Section 3.3, we have constructed a model $$SV, Noisy GT$>$ that could cope better with noisy detection (81.8 vs. 119.6). We observe that if we train the model on 2D detections from the same 2D detector used during testing, i.e. cao_cvpr2017, average MPJPE is improved by only four millimeters. These results indicate that our data generation model presented in Section 3.3 has properly incorporated the detector’s characteristics and our approach generalizes well to test data.
We have also presented the evaluation results of our multiview regression function in Table 3. Training the model $$MV, cao_cvpr2017$>$ on 2D pose detections by the same detector model as the one used at test time achieves the average MPJPE of 49 millimeters, which outperforms pavlakos_cvpr2017_2. This is the lower limit for MPJPE on Human3.6M, which can be obtained by our MV regression model using this singleview pose detector. During our experiments, we observe that even though our multiview regression models have generally converged to lower training losses compared to singleview ones, both singleview and multiview models trained on ground truth poses achieve similar performance (119.6 vs. 118.5). We believe that as the multiview model is only trained on perfect ground truth 2D poses, it always expects the exact projections of a 3D pose in all views. But, since the 2D pose detector provides noisy detections, this is not always possible at test time. The last row shows the results of our multiview regression model trained using 2D poses generated from 3D ground truth by incorporating the 2D detector’s characteristics. We should note that even without finetuning the detector on Human3.6M this model performs similarly to pavlakos_cvpr2017_2, which has been trained on Human3.6M. This model also reduces the error by more than $50\%$ compared to the same model trained on ground truth data only. Furthermore, the model has also improved the localization results by $\sim 30\%$ compared to the singleview model $$SV, Noisy GT$>$ indicating that this model has properly incorporated 2D body part locations across all views to regress for their 3D positions. These results also confirm our hypothesis that incorporating the characteristics of the detector during training enables developing models that are robust to the inaccuracies and failures of the detector at test time.

One view  Two views  Three views  

MVDeep3DPS  Ours  MVDeep3DPS  Ours  MVDeep3DPS  Ours  
Shoulder  19  13  15  8  10  5  
Hip  27  20  23  15  17  11  
Elbow  27  25  23  19  16  12  
Wrist  32  34  25  28  18  16  
\hdashlineAverage  26  23  22  18  15  11 
Multiview OR. In order to assess the ability of our approach to generalize to new multiview environments, we evaluate the performance of our approach on the multiview OR dataset. We use the 3D poses from Human3.6M, the camera calibration parameters of MVOR and the data generation model described in Section 3.2 to train a multiview 3D regression model. The evaluation results of this model on MVOR are presented in Table 4. We use 3D MPJPE in centimeter as evaluation metric. Following the convention in MVDeep3DPS kadkhoda_wacv2017, MPJPE is computed for the same set of body parts and is reported per number of supporting views. Our model has achieved the average MJPJPE of 17 cm on this dataset. The results show a significant improvement in the localization of the body parts as the number of supporting view increases. The average MPJPE is improved by 12 cm for persons who are detected in three views compared to those who are only detected in one view. This clearly indicates the benefit of observing an environment from multiple complementary views and the ability of our regression model to leverage such data for predicting 3D body poses even when some body parts are invisible.
Table 4 also compares the performance of our model with the MVDeep3DPS model kadkhoda_wacv2017. We should note that MVDeep3DPS requires both color and depth images in contrast to our approach that relies solely on color images. Our approach, which only uses Human3.6M data, improves the results over MVDeep3DPS, even though MVDeep3DPS is trained on an annotated dataset recorded in the same OR as the one used to capture MVOR. This evaluation results demonstrate that our approach can exploit existing datasets to easily generalize to new multiview setups without any need for new annotations.
4.5 Qualitative Results
In Figures 4 and 5, we show qualitative results on both Human3.6M and MVOR^{3}^{3} 3 Please note that for generating the qualitative images, the predicted 3D poses are transferred to the room reference frame using an offset computed as the relative difference between the neck location in the ground truth and the neck location in the predicted skeleton.. Each row shows a multiview frame. The predicted 3D poses are shown in the last column and the overlaid 2D poses are obtained by projecting the 3D poses into the views. Figure 4 demonstrates the highquality of the predicted 3D body poses. For example, the frame presented in the last row shows that our approach can successfully incorporate evidence across all views to localize the occluded body parts.
We also show some frames from the multiview OR dataset in Figure 5. As can be seen in this figure, this dataset is much more complex due to the similar appearance of the objects as well as the people and the presence of many objects and multiple persons in the scenes. Our approach predicts fairly accurate 3D body poses and always correctly detects the left and right side labels even though it has not seen any data from this dataset or any other data collected in such an OR environment at the training stage^{4}^{4} 4 More qualitative results generated by our model on both datasets are available at https://youtu.be/Cx_kTRzqqzA.
The complexity of this dataset also allows us to identify some of the limitations of the proposed approach. For example, we observe that the elbow and the wrist localization are less accurate compared to other body parts, which is in line with results presented in Tables 1 and 2. We envision that enforcing appearance consistencies among the projections of a body part across all views can be used to update and improve the 2D body joint detections. The improved 2D detections could then be fed into our multiview regression model to obtain a more accurate localizations of the body parts in 3D. In the last row of Figure 5, we have highlighted a 3D body pose, where the right arm configuration is infeasible because of body physical constraints. We believe that since our training data generation model described in Section 3.2 perturbs 3D poses randomly and does not take the body constraints into account, it may have generated such a training sample. Therefore, it would be interesting to combine our data generation model with a model like the one used in vondrak_cvpr2008 to enforce and verify the physical plausibility of the generated 3D poses.
4.6 Ablation study
EDM  SimpBase  Ours  

GT  62.2  37.1  47.2 
GT+$\mathcal{N}(0,5)$  67.1  46.7  48.4 
GT+$\mathcal{N}(0,10)$  79.1  52.8  50.8 
GT+$\mathcal{N}(0,15)$  96.1  60.0  56.4 
GT+$\mathcal{N}(0,20)$  115.6  70.2  65.7 
We performed several experiments on Human3.6M to study the impact of each of the components of our approach. We first observe that by removing the stagewise supervision, the performance always drops. For example, average MPJPE changes from 57.9 to 77.2 for our $$MV, Noisy GT$>$ model. Removing batch normalization leads to a substantial increase in the error (from 57.9 to 175). We also observe that the use of dropout during the training of singleview models and multiview models on perfect ground truth data is important to obtain more robust models, as it reduces the errors by $2050$ mm. However, deactivating dropout for our multiview models trained on cao_cvpr2017’s detections or Noisy GT decreases localization errors by $2$ and $9$ mm, respectively. We believe that this is due to the fact that 2D detection inputs are constructed from singleview poses that have been independently affected by noise in each view by either the detector inaccuracy or by our data generation model. This independent noise can therefore work as a regularizer to enforce neurons to detect the most relevant information across all views, thereby removing the need for dropout.
Following moreno_cvpr2017 and martinez_iccv2017, we perform a series of experiments to evaluate the performance of our approach under different levels of noise at test time. For a fair comparison, we evaluate our singleview model trained on Noisy GT and add different levels of Gaussian noise to ground truth 2D poses at test time. The evaluation results are presented in Table 5 and are compared with EDM moreno_cvpr2017 and SimpBase martinez_iccv2017. Even though the average localization error of SimpBase is lower than our model’s error by one centimeter when tested on perfect ground truth 2D poses, our model achieves lower localization errors as the noise increases. This indicates that incorporating the detector’s characteristics during training allows our model to better cope with the noise at test time.
In a multiview setup, a 3D body pose can have completely different projections to the views depending on the orientation of the person with respect to the reference coordinate system. We therefore need to construct our multiview regression model in a way that is robust to these changes in the orientation of the person, as our model only relies on these 2D projections to compute 3D body poses. For this reason, we propose to augment the training data by rotating each 3D pose in human3.6M w.r.t. the reference frame. Figure 6 shows the effect of this data augmentation. We report the results of our multiview model $$MV, Noisy GT$>$ on the MVOR dataset as a function of the number of rotations applied to each 3D poses in Human3.6M. The results show that applying up to three random rotations decrease the error but applying more random rotation does not lead to any improvement. Apart from the evaluation results reported in Figure 6, for all the other evaluation on MVOR we always use our multiview model trained on the train set of Human3.6M, which is augmented by applying three random rotations to each 3D pose.
5 Conclusions
We present an easily generalizable approach for estimating 3D body poses using multiview data. We propose a twostep framework to tackle this problem, which separates singleview pose detection from multiview 3D pose regression. The proposed approach permits to effectively exploit existing datasets to generalize to new multiview environments. We use a multistage neural network as regression function to estimate 3D poses. Our model is trained on data generated from a set of valid 3D poses by projecting the 3D poses using the camera parameters used at the test time and by incorporating the characteristics of the singleview pose detector. Our evaluation results indicate the effectiveness and importance of incorporating the detector’s characteristics during training, as it significantly reduces the localization error and achieves results on par with models trained on the output of the detector. We have also evaluated the generalization of our approach on the multiperson MVOR dataset by using only the camera configuration parameters from this dataset during training, but no image data. Our approach yields fairly accurate results and outperforms the stateoftheart model on this dataset. The results also show that the localization error dramatically decreases as the number of supporting views increases. This highlights the benefit of our approach in leveraging multiview data to obtain a reliable model for crowded and cluttered environments. To the best of our knowledge, this is also the first multiview RGB approach that has been quantitatively evaluated on a real world dataset for the task of 3D body part localization.