Abstract
The reasoning-based pose estimation (RPE) benchmark has emerged as a widelyadopted evaluation standard for pose-aware multimodal large language models(MLLMs). Despite its significance, we identified critical reproducibility andbenchmark-quality issues that hinder fair and consistent quantitativeevaluations. Most notably, the benchmark utilizes different image indices fromthose of the original 3DPW dataset, forcing researchers into tedious anderror-prone manual matching processes to obtain accurate ground-truth (GT)annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, ouranalysis reveals several inherent benchmark-quality limitations, includingsignificant image redundancy, scenario imbalance, overly simplistic poses, andambiguous textual descriptions, collectively undermining reliable evaluationsacross diverse scenarios. To alleviate manual effort and enhancereproducibility, we carefully refined the GT annotations through meticulousvisual matching and publicly release these refined annotations as anopen-source resource, thereby promoting consistent quantitative evaluations andfacilitating future advancements in human pose-aware multimodal reasoning.