Abstract
We present EgoPoseFormer, a simple yet effective transformer-based model forstereo egocentric human pose estimation. The main challenge in egocentric poseestimation is overcoming joint invisibility, which is caused by self-occlusionor a limited field of view (FOV) of head-mounted cameras. Our approachovercomes this challenge by incorporating a two-stage pose estimation paradigm:in the first stage, our model leverages the global information to estimate eachjoint's coarse location, then in the second stage, it employs a DETR styletransformer to refine the coarse locations by exploiting fine-grained stereovisual features. In addition, we present a Deformable Stereo Attentionoperation to enable our transformer to effectively process multi-view features,which enables it to accurately localize each joint in the 3D world. We evaluateour method on the stereo UnrealEgo dataset and show it significantlyoutperforms previous approaches while being computationally efficient: itimproves MPJPE by 27.4mm (45% improvement) with only 7.9% model parameters and13.1% FLOPs compared to the state-of-the-art. Surprisingly, with propertraining settings, we find that even our first-stage pose proposal network canachieve superior performance compared to previous arts. We also show that ourmethod can be seamlessly extended to monocular settings, which achievesstate-of-the-art performance on the SceneEgo dataset, improving MPJPE by 25.5mm(21% improvement) compared to the best existing method with only 60.7% modelparameters and 36.4% FLOPs. Code is available at:https://github.com/ChenhongyiYang/egoposeformer .