3D human shape and pose estimation is the essential task for human motionanalysis, which is widely used in many 3D applications. However, existingmethods cannot simultaneously capture the relations at multiple levels,including spatial-temporal level and human joint level. Therefore they fail tomake accurate predictions in some hard scenarios when there is clutteredbackground, occlusion, or extreme pose. To this end, we propose Multi-levelAttention Encoder-Decoder Network (MAED), including a Spatial-Temporal Encoder(STE) and a Kinematic Topology Decoder (KTD) to model multi-level attentions ina unified framework. STE consists of a series of cascaded blocks based onMulti-Head Self-Attention, and each block uses two parallel branches to learnspatial and temporal attention respectively. Meanwhile, KTD aims at modelingthe joint level attention. It regards pose estimation as a top-downhierarchical process similar to SMPL kinematic tree. With the training set of3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4mm of PA-MPJPE on the three widely used benchmarks 3DPW, MPI-INF-3DHP, andHuman3.6M respectively. Our code is available athttps://github.com/ziniuwan/maed.