Abstract
Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimatehuman pose, hand gesture, and facial expression from monocular images. Existingmethods predominantly rely on Transformer-based architectures, which sufferfrom quadratic complexity in self-attention, leading to substantialcomputational overhead, especially in multi-person scenarios. Recently, Mambahas emerged as a promising alternative to Transformers due to its efficientglobal modeling capability. However, it remains limited in capturingfine-grained local dependencies, which are essential for precise EHPS. Toaddress these issues, we propose EMO-X, the Efficient Multi-person One-stagemodel for multi-person EHPS. Specifically, we explore a Scan-based Global-LocalDecoder (SGLD) that integrates global context with skeleton-aware localfeatures to iteratively enhance human tokens. Our EMO-X leverages the superiorglobal modeling capability of Mamba and designs a local bidirectional scanmechanism for skeleton-aware local refinement. Comprehensive experimentsdemonstrate that EMO-X strikes an excellent balance between efficiency andaccuracy. Notably, it achieves a significant reduction in computationalcomplexity, requiring 69.8% less inference time compared to state-of-the-art(SOTA) methods, while outperforming most of them in accuracy.