Abstract
Understanding how two hands interact with each other is a key component ofaccurate 3D interacting hand mesh recovery. However, recent Transformer-basedmethods struggle to learn the interaction between two hands as they directlyutilize two hand features as input tokens, which results in distant tokenproblem. The distant token problem represents that input tokens are inheterogeneous spaces, leading Transformer to fail in capturing correlationbetween input tokens. Previous Transformer-based methods suffer from theproblem especially when poses of two hands are very different as they projectfeatures from a backbone to separate left and right hand-dedicated features. Wepresent EANet, extract-and-adaptation network, with EABlock, the main componentof our network. Rather than directly utilizing two hand features as inputtokens, our EABlock utilizes two complementary types of novel tokens, SimTokenand JoinToken, as input tokens. Our two novel tokens are from a combination ofseparated two hand features; hence, it is much more robust to the distant tokenproblem. Using the two type of tokens, our EABlock effectively extractsinteraction feature and adapts it to each hand. The proposed EANet achieves thestate-of-the-art performance on 3D interacting hands benchmarks. The codes areavailable at https://github.com/jkpark0825/EANet.