Abstract
Sign language is used by deaf or speech impaired people to communicate andrequires great effort to master. Sign Language Recognition (SLR) aims to bridgebetween sign language users and others by recognizing words from given videos.It is an important yet challenging task since sign language is performed withfast and complex movement of hand gestures, body posture, and even facialexpressions. Recently, skeleton-based action recognition attracts increasingattention due to the independence on subject and background variation.Furthermore, it can be a strong complement to RGB/D modalities to boost theoverall recognition rate. However, skeleton-based SLR is still underexploration due to the lack of annotations on hand keypoints. Some efforts havebeen made to use hand detectors with pose estimators to extract hand key pointsand learn to recognize sign language via a Recurrent Neural Network, but noneof them outperforms RGB-based methods. To this end, we propose a novel SkeletonAware Multi-modal SLR framework (SAM-SLR) to further improve the recognitionrate. Specifically, we propose a Sign Language Graph Convolution Network(SL-GCN) to model the embedded dynamics and propose a novel SeparableSpatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. Ourskeleton-based method achieves a higher recognition rate compared with allother single modalities. Moreover, our proposed SAM-SLR framework can furtherenhance the performance by assembling our skeleton-based method with other RGBand depth modalities. As a result, SAM-SLR achieves the highest performance inboth RGB (98.42%) and RGB-D (98.53%) tracks in 2021 Looking at People LargeScale Signer Independent Isolated SLR Challenge. Our code is available athttps://github.com/jackyjsy/CVPR21Chal-SLR