Skeleton Aware Multi-modal Sign Language Recognition

Abstract

Sign language is commonly used by deaf or speech impaired people tocommunicate but requires significant effort to master. Sign LanguageRecognition (SLR) aims to bridge the gap between sign language users and othersby recognizing signs from given videos. It is an essential yet challenging tasksince sign language is performed with the fast and complex movement of handgestures, body posture, and even facial expressions. Recently, skeleton-basedaction recognition attracts increasing attention due to the independencebetween the subject and background variation. However, skeleton-based SLR isstill under exploration due to the lack of annotations on hand keypoints. Someefforts have been made to use hand detectors with pose estimators to extracthand key points and learn to recognize sign language via Neural Networks, butnone of them outperforms RGB-based methods. To this end, we propose a novelSkeleton Aware Multi-modal SLR framework (SAM-SLR) to take advantage ofmulti-modal information towards a higher recognition rate. Specifically, wepropose a Sign Language Graph Convolution Network (SL-GCN) to model theembedded dynamics and a novel Separable Spatial-Temporal Convolution Network(SSTCN) to exploit skeleton features. RGB and depth modalities are alsoincorporated and assembled into our framework to provide global informationthat is complementary to the skeleton-based methods SL-GCN and SSTCN. As aresult, SAM-SLR achieves the highest performance in both RGB (98.42\%) andRGB-D (98.53\%) tracks in 2021 Looking at People Large Scale Signer IndependentIsolated SLR Challenge. Our code is available athttps://github.com/jackyjsy/CVPR21Chal-SLR

Quick Read (beta)

loading the full paper ...