Abstract
Most model-based 3D hand pose and shape estimation methods directly regressthe parametric model parameters from an image to obtain 3D joints under weaksupervision. However, these methods involve solving a complex optimizationproblem with many local minima, making training difficult. To address thischallenge, we propose learning direction-aware hybrid features (DaHyF) thatfuse implicit image features and explicit 2D joint coordinate features. Thisfusion is enhanced by the pixel direction information in the camera coordinatesystem to estimate pose, shape, and camera viewpoint. Our method directlypredicts 3D hand poses with DaHyF representation and reduces jittering duringmotion capture using prediction confidence based on contrastive learning. Weevaluate our method on the FreiHAND dataset and show that it outperformsexisting state-of-the-art methods by more than 33% in accuracy. DaHyF alsoachieves the top ranking on both the HO3Dv2 and HO3Dv3 leaderboards for themetric of Mean Joint Error (after scale and translation alignment). Compared tothe second-best results, the largest improvement observed is 10%. We alsodemonstrate its effectiveness in real-time motion capture scenarios with handposition variability, occlusion, and motion blur.