Epipolar Transformers - Paper Detail

Abstract

A common approach to localize 3D human joints in a synchronized andcalibrated multi-view setup consists of two-steps: (1) apply a 2D detectorseparately on each view to localize joints in 2D, and (2) perform robusttriangulation on 2D detections from each view to acquire the 3D jointlocations. However, in step 1, the 2D detector is limited to solvingchallenging cases which could potentially be better resolved in 3D, such asocclusions and oblique viewing angles, purely in 2D without leveraging any 3Dinformation. Therefore, we propose the differentiable "epipolar transformer",which enables the 2D detector to leverage 3D-aware features to improve 2D poseestimation. The intuition is: given a 2D location p in the current view, wewould like to first find its corresponding point p' in a neighboring view, andthen combine the features at p' with the features at p, thus leading to a3D-aware feature at p. Inspired by stereo matching, the epipolar transformerleverages epipolar constraints and feature matching to approximate the featuresat p'. Experiments on InterHand and Human3.6M show that our approach hasconsistent improvements over the baselines. Specifically, in the conditionwhere no external data is used, our Human3.6M model trained with ResNet-50backbone and image size 256 x 256 outperforms state-of-the-art by 4.23 mm andachieves MPJPE 26.9 mm.

Quick Read (beta)

loading the full paper ...