CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Abstract

In sign language, the conveyance of human body trajectories predominantlyrelies upon the coordinated movements of hands and facial expressions acrosssuccessive frames. Despite the recent advancements of sign languageunderstanding methods, they often solely focus on individual frames, inevitablyoverlooking the inter-frame correlations that are essential for effectivelymodeling human body trajectories. To address this limitation, this paperintroduces a spatial-temporal correlation network, denoted as CorrNet+, whichexplicitly identifies body trajectories across multiple frames. In specific,CorrNet+ employs a correlation module and an identification module to buildhuman body trajectories. Afterwards, a temporal attention module is followed toadaptively evaluate the contributions of different frames. The resultantfeatures offer a holistic perspective on human body movements, facilitating adeeper understanding of sign language. As a unified model, CorrNet+ achievesnew state-of-the-art performance on two extensive sign language understandingtasks, including continuous sign language recognition (CSLR) and sign languagetranslation (SLT). Especially, CorrNet+ surpasses previous methods equippedwith resource-intensive pose-estimation networks or pre-extracted heatmaps forhand and facial feature extraction. Compared with CorrNet, CorrNet+ achieves asignificant performance boost across all benchmarks while halving thecomputational overhead. A comprehensive comparison with previousspatial-temporal reasoning methods verifies the superiority of CorrNet+. Codeis available at https://github.com/hulianyuyy/CorrNet_Plus.

Quick Read (beta)

loading the full paper ...