Abstract
Two-hand reconstruction from monocular images faces persistent challenges dueto complex and dynamic hand postures and occlusions, causing significantdifficulty in achieving plausible interaction alignment. Existing approachesstruggle with such alignment issues, often resulting in misalignment andpenetration artifacts. To tackle this, we propose a dual-stageFoundation-to-Diffusion framework that precisely align 2D prior guidance fromvision foundation models and diffusion-based generative 3D interactionrefinement to achieve occlusion-robust two-hand reconstruction. First, weintroduce a lightweight fusion alignment encoder that aligns fused multimodal2D priors like key points, segmentation maps, and depth cues from visionfoundation models during training. This provides robust structured guidance,further enabling efficient inference without heavy foundation model encoders attest time while maintaining high reconstruction accuracy. Second, we implementa two-hand diffusion model explicitly trained to convert interpenetrated 3Dposes into plausible, penetration-free counterparts. Through collisiongradient-guided denoising, the model rectifies artifacts while preservingnatural spatial relationships between hands. Extensive evaluations demonstratethat our method achieves state-of-the-art performance on InterHand2.6M, HIC,and FreiHAND datasets, significantly advancing occlusion handling andinteraction robustness. Our code will be publicly released.