Neural Sign Language Translation based on Human Keypoint Estimation

Abstract

We propose a sign language translation system based on human keypointestimation. It is well-known that many problems in the field of computer visionrequire a massive amount of dataset to train deep neural network models. Thesituation is even worse when it comes to the sign language translation problemas it is far more difficult to collect high-quality training data. In thispaper, we introduce the KETI sign language dataset which consists of 11,578videos of high resolution and quality. Considering the fact that each countryhas a different and unique sign language, the KETI sign language dataset can bethe starting line for further research on the Korean sign language translation. Using the KETI sign language dataset, we develop a neural network model fortranslating sign videos into natural language sentences by utilizing the humankeypoints extracted from a face, hands, and body parts. The obtained humankeypoint vector is normalized by the mean and standard deviation of thekeypoints and used as input to our translation model based on thesequence-to-sequence architecture. As a result, we show that our approach isrobust even when the size of the training data is not sufficient. Ourtranslation model achieves 94.6% (60.6%, respectively) translation accuracy onthe validation set (test set, respectively) for 105 sentences that can be usedin emergency situations. We compare several types of our neural signtranslation models based on different attention mechanisms in terms ofclassical metrics for measuring the translation performance.

Quick Read (beta)

loading the full paper ...