Abstract
Transformer architectures have become the model of choice in natural languageprocessing and are now being introduced into computer vision tasks such asimage classification, object detection, and semantic segmentation. However, inthe field of human pose estimation, convolutional architectures still remaindominant. In this work, we present PoseFormer, a purely transformer-basedapproach for 3D human pose estimation in videos without convolutionalarchitectures involved. Inspired by recent developments in vision transformers,we design a spatial-temporal transformer structure to comprehensively model thehuman joint relations within each frame as well as the temporal correlationsacross frames, then output an accurate 3D human pose of the center frame. Wequantitatively and qualitatively evaluate our method on two popular andstandard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experimentsshow that PoseFormer achieves state-of-the-art performance on both datasets.Code is available at \url{https://github.com/zczcwh/PoseFormer}