MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language

  • 2019-11-20 22:42:52
  • Hamid Reza Vaezi Joze, Oscar Koller
  • 0

Abstract

Sign language recognition is a challenging and often underestimated problemcomprising multi-modal articulators (handshape, orientation, movement, upperbody and face) that integrate asynchronously on multiple streams. Learningpowerful statistical models in such a scenario requires much data, particularlyto apply recent advances of the field. However, labeled data is a scarceresource for sign language due to the enormous cost of transcribing theseunwritten languages. We propose the first real-life large-scale sign language data set comprisingover 25,000 annotated videos, which we thoroughly evaluate withstate-of-the-art methods from sign and related action recognition. Unlike thecurrent state-of-the-art, the data set allows to investigate the generalizationto unseen individuals (signer-independent test) in a realistic setting withover 200 signers. Previous work mostly deals with limited vocabulary tasks,while here, we cover a large class count of 1000 signs in challenging andunconstrained real-life recording conditions. We further propose I3D, knownfrom video classifications, as a powerful and suitable architecture for signlanguage recognition, outperforming the current state-of-the-art by a largemargin. The data set is publicly available to the community.

 

Quick Read (beta)

MS-ASL: A Large-Scale Data Set and Benchmark for Understanding
American Sign Language

Abstract

Sign language recognition is a challenging and often underestimated problem comprising multi-modal articulators (handshape, orientation, movement, upper body and face) that integrate asynchronously on multiple streams. Learning powerful statistical models in such a scenario requires much data, particularly to apply recent advances of the field. However, labeled data is a scarce resource for sign language due to the enormous cost of transcribing these unwritten languages.

We propose the first real-life large-scale sign language data set comprising over 25,000 annotated videos, which we thoroughly evaluate with state-of-the-art methods from sign and related action recognition. Unlike the current state-of-the-art, the data set allows to investigate the generalization to unseen individuals (signer-independent test) in a realistic setting with over 200 signers. Previous work mostly deals with limited vocabulary tasks, while here, we cover a large class count of 1000 signs in challenging and unconstrained real-life recording conditions. We further propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition, outperforming the current state-of-the-art by a large margin. The data set is publicly available to the community.


MS-ASL: A Large-Scale Data Set and Benchmark for Understanding
American Sign Language


Hamid Reza Vaezi Joze [email protected] Microsoft Redmond, WA, USA
Oscar Koller [email protected] Microsoft Munich, Germany

 

\@footnotetext

© 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.

1 Introduction

In the US, approximately around 500,000 people use American sign language (ASL) as primary means of communication [? ]. ASL is also used in Canada, Mexico and 20 other countries. Just like any other natural language, it covers its unique vocabulary as well as a grammar which is different from spoken English. We are intrigued by sign language and accessibility for the Deaf and believe sign recognition is an exciting field that offers many challenges for computer vision research.

For decades, researcher from different fields have tried to solve the challenging problem of sign language recognition. Most of the proposed approaches rely on external devices such as additional RGB [? ] or depth cameras [? ? ], sensor [? ? ] or colored gloves [? ]. However, such requirements limit the applicability to specific settings where such resources are available. Opposed to that, non-intrusive and purely vision based sign recognition will allow for general usage. With the appearance of deep learning based methods and their powerful performance on computer vision tasks, the requirements on training data have changed dramatically from few hundred to thousands of samples being needed to train strong models. Unfortunately, public large scale sign language resources suitable for machine learning are very limited and there is currently no public ASL data set big enough to evaluate recent deep learning approaches. This prevents recent computer vision trends to be applied to this field. As such, our goal is to advance the sign language recognition community and the related state-of-the-art by releasing a new large-scale data set, establishing thorough baselines and carrying over recent computer vision trends. With this work, we make the following contributions: (1) We release the first large-scale ASL data set called MS-ASL that covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 signs  11 1 Instructions and download links: https://www.microsoft.com/en-us/research/project/ms-asl/. (2) We evaluate current state-of-the-art approaches: 2D-CNN-LSTM, body key-point, CNN-LSTM-HMM and 3D-CNN as baselines. (3) We propose I3D (known from action recognition) as a powerful and suitable architecture for sign language recognition that outperforms previous state-of-the-art by a large margin and provide new pre-trained model for it. (4) We estimate the effect of number of classes and number of samples on the performance.

2 Previous Works

Recognition methods: Researchers have tried to solve the challenges of sign language recognition in different ways. In 1983, the first work was a glove based device that allowed to recognize ASL fingerspelling based on a hardwired circuit [? ]. In the meantime, there have been a lot of related approaches which rely on tracked hand movements based on sensor gloves for sign recognition [? ? ? ? ? ]. Some works extended this by adding a camera as a new source of information [? ] and they showed that adding video information improves the accuracy of detection but the method mainly relies on the glove sensors.

In 1988, Tamura et al. [? ] were the first to follow vision-based sign language recognition. They built a system to recognize 10 isolated signs of Japanese sign language using simple color thresholding. Because the sign is performed in 3-dimensions, many vision based approaches use depth information [? ? ? ] or multiple cameras [? ]. Some rely on colored gloves to ease hand and finger tracking [? ? ]. In this paper we focus on non-intrusive sign language recognition using only a single RGB camera as we believe this will allow to design tools for general usage to empower everybody to communicate with a deaf person using ASL. The sole use of RGB for sign detection is not new, traditional computer vision techniques particularly with Hidden Markov Models [? ? ? ? ], mainly inspired by improvements in speech recognition, have been in use in the past two decades. With the advances of deep learning and convolutional networks for image processing the field has evolved tremendously. Koller et al. showed large improvements embedding 2D-CNNs in HMMs [? ? ], related works with 3D-CNNs exist [? ? ] and weakly supervised multi-stream modeling is presented in [? ]. However, sign language recognition still lags behind related fields in the adoption of trending deep learning architectures. To the best of our knowledge no prior work exists that leverages latest findings from action recognition with I3D networks or complete body key-points which we will address with this paper.

Sign language data sets: Some outdated reviews of sign language corpora exists [? ]. Below, we have reviewed sign language data sets with explicit setups intended for reproducible pattern recognition research. The Purdue RVL-SLLL ASL database [? ? ] contains 10 short stories with a vocabulary of 104 signs and a total sign count of 1834 produced by 14 native signers in a lab environment under controlled lighting. The RWTH-BOSTON corpora were originally created for linguistic research [? ] and packaged for pattern recognition purposes later. The RWTH-BOSTON-50 [? ] and the RWTH-BOSTON-104 corpus contain isolated sign language with a vocabulary of 50 and 104 signs. The RWTH-BOSTON-400 corpus contains a vocabulary of 483 signs and also constitutes of continuous signing by 5 signers. The SIGNUM corpus [? ] provides two evaluation sets: first a multisigner set with 25 signers, each producing 603 predefined sentences with 3703 running gloss annotation and a vocabulary of 455 different signs. Second, it has a single signer setup where the signer produces three repetitions of the given sentences. In the scope of the DictaSign project, multi-lingual sign language resources have been created [? ? ? ]. However, the produced corpora are not well curated and made available for reproducible research. The Greek Sign Language (GSL) Lemmas Corpus [? ] constitutes such a data collection. It provides a subset with isolated sign language (single signs) that contains 5 repetitions of the signs produced by two native signers. However, different versions of this have been used in the literature disallowing fair comparisons and the use as benchmark corpus. The corpus has been referred to with 1046 signs [? ? ], with 984 signs [? ] and with 981 signs [? ]. Additionally, a continuous 100 sign version of the data set has been used in [? ]. The reason for all these circulating subsets is that the data has not been made publicly available. DEVISIGN is a Chinese sign language data set featuring isolated single signs performed by 8 non-natives [? ] in a laboratory environment (controlled background). The data set is organized in 3 subsets, covers a vocabulary of up to 2000 isolated signs and provides RGB with depth information in 24000 recordings. A more recent data set [? ] covers 100 continuous chinese sign language sentences produced five times by 50 signers. It has a vocabulary of 178 signs. It can be considered as staged recording. The Finish S-pot sign spotting task [? ] is based on the controlled recordings from the Finish sign language lexicon [? ]. It covers 1211 isolated citation form signs that need to be spotted in 4328 continuous sign language videos. However, the task has not been widely adopted by the field. The RWTH-PHOENIX-Weather 2014 [? ? ] and RWTH-PHOENIX-Weather 2014 T [? ] are large scale real-life sign language corpora that feature professional interpreters recorded from broadcast news. They cover continuous German sign language with a vocabulary of over 1000 signs, about 9 hours of training data. It only features 9 signers and limited computer vision challenges. There are several groups which experimented with their own private data collection resulting in corpora with quite limited size in terms of total number of annotations and vocabulary such as UWB-07-SLR-P corpus of Czech sign language [? ], data set by Barabara Loedings Group [? ? ] and small scale corpora [? ? ].

Data set classes signer independ. videos real-life
Purdue ASL [? ] 104 14 no 1834 no
Video-Based CSL [? ] 178 50 yes 25000 no
Signum [? ] 465 25 yes 15075 no
RWTH-Boston [? ] 483 5 no 2207 no
RWTH-Phoenix [? ] 1080 9 no 6841 yes
Devisign [? ] 2000 8 no 24000 no
This work 1000 222 yes 25513 yes
Table 1: Comparison of public sign language data sets.

Table 1 summarizes the mentioned data sets. To the best of our knowledge RWTH-PHOENIX-Weather 2014 and DEVISIGN are currently the only publicly available data sets that are large enough to cover recent deep learning approaches. However, both data sets are lacking the variety and number of signers to advance the state-of-the-art with respect to the important issue of signer independence and computer vision challenges from natural unconstrained recordings. In the scope of this work, we propose the first ASL data set that covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 gloss level signs.

3 Proposed ASL Data Set

Since there is no public ASL data set suitable for large-scale sign language recognition, we looked for realistic data sources. The deaf community actively uses public video sharing platforms for communication and study of ASL. Many of those videos are captured and uploaded by ASL students and teachers. They constitute challenging material with large variation in view, background, lighting and positioning. Also from a language point of view, we encounter regional, dialectal and inter-signer variation. This seems very appealing from a machine learning point of view as it may further close the gap in learning signer independent recognition systems that can perform well in realistic circumstances. Besides having access to well suited data, the main issue remains labeling which requires skilled ASL natives.

We noticed that a lot of the public videos have manual subtitles, captions, descriptions or a video title that indicates which signs are being performed in it. We therefore decided to access the public ASL videos and obtain the text from all those sources. We process these video clips automatically in three distinct ways: (1) For longer videos, we used Optical Character Recognition (OCR) to find printed labels and their time of occurrence. (2) Longer videos may contain video captions that provide the sign descriptor and the temporal segmentation. (3) In short videos we obtained the label directly from the title.

In the next step, we detected bounding boxes and used face recognition to find and track the signer. This allowed identification of descriptions that refer to a static image rather than an actual signer. If we identified multiple signers performing one after the other, we splited the video up into smaller samples.

In total we accessed more than 45,000 video samples that include words or phrases in their descriptions. We sorted the words based on frequency to find the most frequently used ones while removing misspellings and OCR mistakes. Since many of the ASL vocabulary publicly accessible videos belong to teachers performing a lesson vocabulary or students doing their homework, all top hundred words belong to ASL tutorial books [? ? ] vocabulary units. Some of the videos referenced in MS-ASL originate from ASL-LEX [? ].

3.1 Manual Touch-up

Although many of the sample videos are good for training purposes, some of them include the instruction to the sign or several repeated performances with long pause in between. Therefore, we decided to manually trim all video samples with a duration of more than 8 seconds. For higher accuracy on the test set, we chose the threshold to be 6 seconds there. Although our annotators were not native in ASL, they could easily trim these video samples while considering other samples of the same label. We also decided to review video samples shorter than 20 frames. In this way, around 25% of the data set was manually reviewed. Figure 2 illustrates a histogram of the duration of the 25,513 video samples of signs after the manual touch-up. There are unusual peaks for multiples of 10 frames which seems to be caused by video editing software cutting and adding captions, which favors such duration. Despite that, the histogram looks like a Poisson distribution with the average of 60. Combined, the duration of the video samples is just over 24 hours long.

Figure 1: Histogram of frame numbers for ASL1000 video samples.
Figure 2: Showing the number of video samples for each of the 222 signers and the train/test/validation split of proposed data set.

3.2 ASL synonyms

Sign languages all over the world are independent, fully fledged languages with their own grammar and word inventory, distinct from the related spoken language. Sign languages can be characterized as unwritten languages and have no commonly used written form. Therefore, a written word will usually represent a semantic description and refer to the meaning of a sign, not to the way it is executed. This is fundamentally different to most writing schemes of spoken languages. As an example, look at the two English words Clean and Nice. While they are clearly distinct in English, they have similar signs in ASL which share the same hand gesture and movement. On the other hand, the English word Run has a distinct sign for each of its meaning such as "walk and run", "run for office", "run away" and "run a business" [? ]. With respect to the ASL videos we accessed from the internet and their descriptions, we needed to make sure that similar ASL signs merged to one class for training even if they have distinct English descriptors. This process was implemented based on a reference ASL Tutorial books [? ]. This mapping of sign classes will be released as part of the MS-ASL data set.

3.3 Signer Identification

Signer dependency is one of the most blocking challenges with current non-intrusive sign recognition approaches. To address this issue, our goal is to create a recognition corpus which covers signer independent sets. We want to ensure that the signers occurring in train, validation and test are distinct. Therefore, we aimed at identifying the signer in each sample video. To achieve this, we computed 5 face embeddings [? ] for each video sample. Based on this, the video samples were then clustered into 457 clusters. Some of these clusters were merged later by using the prior knowledge that two consecutive samples from a video tend to have the same signer. Additionally, we manually labeled the low confidence clusters. Finally, we ended up having 222 distinct signers. The found individuals occurred in the corpus with very diverse frequency. We have 3 signers with more than one thousand video samples and 10 signers with a single video sample each. We then solve an optimization problem to distribute signers into train, validation and test set signers aiming to divide data set partitions to 80%, 10% and 10% for train, validation and test, respectively. However, due to the signer independency constraint and unbalances samples, an exact division into these sizes was impossible. We relaxed this condition, maintaining at least one sample in each set for each class. The final amount of signers in each of the sets was 165, 37 and 20 for train, validation and test, respectively. Figure 2 shows the frequency of samples by all 222 signers and the train/validation/test split.

3.4 MS-ASL Data Set with 4 Subsets

In order to have a good understanding of the ASL vocabulary and being a comprehensive benchmark for classifying signs with diverse training samples, We release 4 subsets including 100, 200, 500 and 1000 most frequent words. Each includes their own train, test and validation sets. All these sets are signer independent and the signers for train (165), test (20) and validation (37) are the same as shown in Figure 2, therefore smaller sets are subset of the larger. We call these subsets ASL100, ASL200, ASL500 and ASL1000 for the rest of this paper. Table 2 shows the characteristics of each of these sets. In ASL100, there are at least 45 samples for each class while in ASL1000 there are at least 11 samples for each class.

Data Set Challenges: There are challenges in this data set which make it unique compared to other sign language data sets and more challenging compared to video classification data sets: (1) One sample video may include repetitive act of a distinct signs. (2) One word can sign differently in different dialects based on geographical regions. As an example, there are 5 common signs for the word Computer. (3) It includes large number of signers and is a signer independent data set. (4) They are large visual variabilities in the videos such as background, lighting, clothing and camera view point.

Number of Videos Duration Videos per class
Data set Class Subjects Train Validation Test Total [hours:min] Min Mean


ASL100
100 189 3789 1190 757 5736 5:33 47 57.4
ASL200 200 196 6319 2041 1359 9719 9:31 34 48.6
ASL500 500 222 11401 3702 2720 17823 17:19 20 35.6
ASL1000 1000 222 16054 5287 4172 25513 24:39 11 25.5
Table 2: Showing statistics of the 4 proposed subsets of the MS-ASL data set.

3.5 Evaluation Scheme

We suggest two metrics for evaluating the algorithms ran in these data sets: 1) average per class accuracy, 2) average per class top-five accuracy. We prefer per class accuracy compared to plain accuracy to better reflect performance given the unbalance test set inherited from the unbalance nature of the data set. To be more precise, we compute the accuracy of each class and reported the average value. In the top-5 accuracy, we call it correct if the ground-truth label appears in the top five guesses of the method being evaluated. We compute top-five accuracy for each class and report the average value. ASL, just like any other language can have ambiguity which can be resolved in context. Therefore, we picked top-five accuracy.

4 Baseline Methods

Although it is much more challenging, but we can consider isolated sign language recognition similar to action recognition or gesture detection as it is a video classification task for a human being. We can categorize current action recognition or gesture detection into three major categories or combination of them 1) Using 2D convolution on image and do a recurrent network on top of that  [? ? ] 2) Extracting subject’s body joints in the form of skeleton and using skeleton data for recognition [? ? ] 3) Using 3D convolution [? ? ? ]. In order to have baselines from each categories of human action recognition, we implement at least one method for each of these categories.

For all of the methods, we use the bounding box covering the signer as input image. We extract the person bounding box by the SSD network [? ] and release it for each video sample as part of MS-ASL data set. For spatial augmentations, body bounding boxes are randomly scaled or translated by 10%, fit into a square and re-sized to fixed 224×224 pixels. We picked 64 as our temporal window which is the average number of frames across all sample videos. In addition, the resulted video is randomly but consistently (per video) flipped horizontally because ASL is symmetrical and can be performed by either hands. We used fixed sized frame number as well as fixed size resolution for 2D and 3D convolution methods. For temporal augmentations: 64 consecutive frames are picked randomly from the videos and shorter videos are randomly elongated by repeating their fist or last frame. We train for 40 epochs. In this paper, we focused on RGB only algorithms and did not use optical flow for any of the implementations. It is a proven fact that using optical flow as second stream in train and test stage [? ? ? ] or just train stage [? ] boosts the performance of prediction. Herein, we describe the methods used for determining baselines.

2D-CNN : The high performance of 2D convolutional networks on image classification makes them the first candidate for video processing. This is achieved by extracting features from each frame of the video independently. The first approach was to combine these features by simply pooling the predication, but it ignored the frame ordering or timing. The next approach which proved more successful, was using recurrent layers on the top of 2D convolution networks. Motivated by [? ], we picked LSTM [? ] as our recurrent layer which records the temporal ordering and long range dependencies by encoding the states. We used VGG16 [? ] network followed by an average pooling and LSTM layer of size 256 with batch normalization. The final layers are a 512 hidden units followed by a fully connected layer for classification. We considered the output on final frame for testing. We also have implemented [? ] as the state-of-the-art on PHOENIX2014 data set [? ? ]. This method use GoogleNets [? ] as 2D-CNN with 2 bi-directional LSTM layers and 3-state HMM. We report it as Re-Sign in the experimental result.

Body Key-Points : With the introduction of robust body key-points (so-called skeleton) detection [? ], some studies try to solve human action recognition by body joints only [? ? ] or use body joints along with the image stream [? ]. Since most body key-point techniques did not cover hand details, it was not rational to use it for sign language recognition task as it relies heavily on the movement of fingers. But a recent work has covered hand and face key-points along with classical skeleton [? ]. We leveraged this technique which extracted 137 key-points in total, to do a baseline on our data set by body key-points. We extracted all the key-points for all samples using  [? ? ]. Using 64 frames for time window, our input to the network would be 64×137×3 representing x, y coordinates and confidence values for the 137 body key-points for all consecutive 64 frames. Figure 3 illustrates the extracted 137 body key-points for a video sample from MS-ASL. The hand key-points are not as robust as body and face.

Figure 3: Extracted 137 body key-points for a video sample from MS-ASL by [? ? ].

We implemented hierarchical co-occurrence network (HCN) [? ] which originally used 15 joints. We extended this work by using 137 body key-points including hand and face key-points. The input to this network is original 137 body key-points as well as per frame difference of them. The network includes three layers of 2D convolution on top of each input as well as two extra 2D convolution layers after the concatenation of two paths.

3D-CNN : Recently, 3D convolutional networks have shown promising performance for video classification and action recognition including C3D network [? ] and I3D network [? ]. We applied C3D [? ] released code from author as well as our own implemented version to our proposed data sets with and without pre-trained model, trained on Sport-1M [? ]. The model did not converge for any of our experiments. We adopted the architecture of I3D networks proposed in [? ] and employed its suggested implementation details. This network is an inflated version of Inception-V1 [? ], which contains several 3D convolutional layers followed by 3D max-pooling layers and inflated Inception-V1 submodules. We started with pre-trained network trained on Imagenet [? ] and Kinetics [? ]. We optimized the objective functions with standard SGD with momentum set to 0.9. We began the base learning rate at 10-2 with a 10× reduction at epoch 20 when validation loss saturated.

5 Experimental Result

We trained all of the methods mentioned in section 4 on four MS-ASL subsets (ASL100, ASL200, ASL500 and ASL1000) and computed the accuracy for test set which includes subjects that are not included in training phase. As described in subsection 3.5, we report two evaluation metrics: average per class accuracy and average per class top-five persent accuracy. The results are reported in Table 3 and Table 4 respectively. We did not over optimize training parameters. Hence, these results constitute baseline for 2D-CNN, 3D-CNN, CNN-LSTM-HMM and body key-point based approaches. The experimental result suggests that this data set is very difficult for 2D-CNN or at least LSTM could not propagate the recurrent information well. In video classification data sets such as UCF101 [? ] or HMDB51 [? ], the image itself carries context information regarding the classification while in MS-ASL there is minimum context information in a single image. Re-Sign [? ] which report as state-of-the-art in few sign language dataset could not achieve well for challenging MS-ASL (This method could not predict top-five). Body key-point based approach (HCN) is doing relatively better compared to 2D-CNN but there is huge room for improvement because of network simplicity as well as future improvements for hand key-point extraction. On the other hand our 3D-CNN baseline did achieve good results in this challenging, uncontrolled data set and we propose it as powerful network for sign language recognition.

Method ASL100 ASL200 ASL500 ASL1000
Naive Classifier 0.99 0.50 0.21 0.11
VGG+LSTM [? ? ] 13.33 7.56 1.47 -
HCN [? ] 46.08 35.85 21.45 15.49
Re-Sign [? ] 45.45 43.22 27.94 14.69
I3D [? ] 81.76 81.97 72.50 57.69
Table 3: The average per class accuracy for baseline method on proposed ASL data sets.
Method ASL100 ASL200 ASL500 ASL1000
Naive classifier 4.86 2.49 1.05 0.58
VGG+LSTM [? ? ] 33.42 21.21 5.86 -
HCN [? ] 73.98 60.29 43.83 32.50
I3D [? ] 95.16 93.79 89.80 81.08
Table 4: The average per class top-five accuracy for baseline methods on proposed data sets.

5.1 Qualitative Discussion

Figure 5 illustrates the confusion matrix obtained by comparison of the grand-truth labels and the predicted labels from models trained by I3D on ALS200 data set. As we expected, most of the values lay on the diagonal element. Here is the list of brightest points off the diagonal with value of more than .25 which represents per class worst predictions:

- Good labeled as Thanks (.4): often the sign Good is done without the base hand, this sign can mean Thanks or Good

- Water labeled as Mother (.33): both by placing dominant hand around chin area while the detail is different.

- Today labeled as Now (.33): two versions for Today one of them is signing Now twice.

- Not labeled as Nice (.33), Aunt labeled as Nephew (.33), Tea labeled as Nurse (.33)

- Start labeled as Finish (.3)

- My labeled as Please (.28): both sign by place the dominant hand on the chest. A clockwise motion for Please and gentle slapping for My.

Figure 4: The accuracy of trained models based on frequency of training samples.
Figure 5: The confusion matrix for I3D on ALS200 data set.

We did similar investigation for other data sets and find interesting evidence about language ambiguity that could solve within the context. Therefore, the error of the model is combination of language ambiguity and prediction error. Our observation shows when we have smaller training sets, model error mainly come from prediction errors but for classes with more samples the error could came from language ambiguity. This advise us to use five-top as our second metric since eventually these predication need to feed to language model with context.

5.2 The Effect of Pre-Trained Model

The fact that I3D training on ASL200 outperformed I3D trained on ASL100 was not convincing as it contains twice the classes as ASL100. We verified this result with further experiments. We evaluate the I3D model trained with the ASL200 data on ASL100 to study the effect of data. The average per class accuracy reaches 83.36% which made the results less convincing. The only proposed explanation is the lack of adequate training video samples which is less than four thousands. This prompted us to do a new experiment; We trained I3D on ASL100 using the same setting as the last experiments except for using ASL200 as pre-trained model instead of ImageNet+Kinetics pre-trained model. The result was 85.32% for average per class accuracy and 96.53% for average per class top-five accuracy which is more than 3.5% performance boost. This is a valid experimental approach as the test and train are still separated due to signer independency. This verifies our reasoning and suggests that the existing out-of-domain pre-trained models can be easily outperformed by in-domain pre-trained models specific for sign language recognition. We proposed the model trained on MS-ASL as a I3D pre-trained model for sign language recognition tasks.

5.3 The Effect of Number of Classes

In order to evaluate the effect of number of classes in model prediction, we tested the I3D model trained on ASL1000 training sets on ASL500, ASL200 and ASL100 test sets. This allowed a comparison between the model trained on 100 classes with the one trained with 1000 classes on the same test set. We did similar experiments with all possible pairs and reported the average per class accuracy on Table 5. In this table we show subsets of the MS-ASL data set on the horizontal axis and the tested subsets on the vertical axis. Increasing the number of classes decreased the accuracy of either the train or the test phase. Doubling the size of test classes led to a small change from 83.36% to 81.97% and doubling the size of the train classes from 85.32% to 83.36%. This suggests that the observed effect is significantly less when we have more video samples per class.

\hlineB2 I3D trained on ASL100 ASL200 ASL500 ASL1000
ASL100 85.32% - - -
ASL200 83.36% 81.97% - -
ASL500 80.61% 78.73% 72.50% -
ASL1000 75.38% 74.78% 68.49% 57.69%
\hlineB2
Table 5: Showing the average per class accuracy of the model trained on different subsets of the MS-ASL data set (horizontal), subsets tested on (vertical).

5.4 The Effect of Number of Video Samples

In order to determine the adequate number of video samples per word needed to a good model, we experimented with the number of samples illustrated figure 5. It shows the accuracy of the models based on frequency of training data for our experiments on test data. It shows a somewhat similar curve for all the four experiments suggesting that the accuracy correlates directly to the number of training video samples for classes with less than 40 video samples. However, for classes with more than 40 video samples, the difficulty of the signs may be more important. Although we have average accuracy of 80% for classes with more than 40 training video samples, it does not suggest that 40 is the sweet spot. Direct comparison cannot be made as this dataset lacks other classes which are significantly larger than 40 video samples. The curve deep at x=54 for all networks belongs to the class Good which is the only class with 54 training samples. We have discussed this in subsection 5.1.

6 Conclusion

In this paper, we proposed the first large-scale ASL data set with 222 signers and signer independent sets. Our dataset contains a large class count of 1000 signs recorded in challenging and unconstrained conditions. We evaluated the state-of-the-art network architectures and approaches as the baselines on our data set and demonstrated that I3D outperforms current state-of-the-art methods by a large margin. We also estimated the effect of number of classes on the recognition accuracy.

For future works, we propose applying optical flow on the videos as it is a strong information extraction tool. We can also try leveraging body key-points and segmentation on the training phase only. We believe that the introduction of this large-scale data set will encourage and enable the sign language recognition community to catch up with latest computer vision trends.

Bibliography