FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Abstract

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos topredict human joint coordinates in 3D space. Despite recent advancements indeep learning-based methods, they mostly ignore the capability of couplingaccessible texts and naturally feasible knowledge of humans, missing out onvaluable implicit supervision to guide the 3D HPE task. Moreover, previousefforts often study this task from the perspective of the whole human body,neglecting fine-grained guidance hidden in different body parts. To this end,we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion modelfor 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancingthe reverse process of the diffusion model: (1) Fine-grained Part-aware Promptlearning (FPP) block constructs fine-grained part-aware prompts via couplingaccessible texts and naturally feasible knowledge of body parts with learnableprompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication(FPC) block establishes fine-grained communications between learned part-awareprompts and poses to improve the denoising quality. (3) Prompt-driven TimestampStylization (PTS) block integrates learned prompt embedding and temporalinformation related to the noise level to enable adaptive adjustment at eachdenoising step. Extensive experiments on public single-human pose estimationdatasets show that FinePOSE outperforms state-of-the-art methods. We furtherextend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPEon the EgoHumans dataset demonstrates the potential of FinePOSE to deal withcomplex multi-human scenarios. Code is available athttps://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.

Quick Read (beta)

loading the full paper ...