Abstract
Sign language is commonly used by deaf or mute people to communicate butrequires extensive effort to master. It is usually performed with the fast yetdelicate movement of hand gestures, body posture, and even facial expressions.Current Sign Language Recognition (SLR) methods usually extract features viadeep neural networks and suffer overfitting due to limited and noisy data.Recently, skeleton-based action recognition has attracted increasing attentiondue to its subject-invariant and background-invariant nature, whereasskeleton-based SLR is still under exploration due to the lack of handannotations. Some researchers have tried to use off-line hand pose trackers toobtain hand keypoints and aid in recognizing sign language via recurrent neuralnetworks. Nevertheless, none of them outperforms RGB-based approaches yet. Tothis end, we propose a novel Skeleton Aware Multi-modal Framework with a GlobalEnsemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fusemulti-modal feature representations towards a higher recognition rate.Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) tomodel the embedded dynamics of skeleton keypoints and a SeparableSpatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. Theskeleton-based predictions are fused with other RGB and depth based modalitiesby the proposed late-fusion GEM to provide global information and make afaithful SLR prediction. Experiments on three isolated SLR datasets demonstratethat our proposed SAM-SLR-v2 framework is exceedingly effective and achievesstate-of-the-art performance with significant margins. Our code will beavailable at https://github.com/jackyjsy/SAM-SLR-v2