SF-Net: Structured Feature Network for Continuous Sign Language Recognition

  • 2019-08-04 13:34:41
  • Zhaoyang Yang, Zhenmei Shi, Xiaoyong Shen, Yu-Wing Tai
  • 1

Abstract

Continuous sign language recognition (SLR) aims to translate a signingsequence into a sentence. It is very challenging as sign language is rich invocabulary, while many among them contain similar gestures and motions.Moreover, it is weakly supervised as the alignment of signing glosses is notavailable. In this paper, we propose Structured Feature Network (SF-Net) toaddress these challenges by effectively learn multiple levels of semanticinformation in the data. The proposed SF-Net extracts features in a structuredmanner and gradually encodes information at the frame level, the gloss leveland the sentence level into the feature representation. The proposed SF-Net canbe trained end-to-end without the help of other models or pre-training. Wetested the proposed SF-Net on two large scale public SLR datasets collectedfrom different continuous SLR scenarios. Results show that the proposed SF-Netclearly outperforms previous sequence level supervision based methods in termsof both accuracy and adaptability.

 

Quick Read (beta)

SF-Net: Structured Feature Network for Continuous Sign Language Recognition

Zhaoyang Yang1, Zhenmei Shi211footnotemark: 1 , Xiaoyong Shen1, Yu-Wing Tai1 1Tencent, 2Hong Kong University of Science and Technology [email protected], [email protected], [email protected], [email protected] Equal contribution
Abstract

Continuous sign language recognition (SLR) aims to translate a signing sequence into a sentence. It is very challenging as sign language is rich in vocabulary, while many among them contain similar gestures and motions. Moreover, it is weakly supervised as the alignment of signing glosses is not available. In this paper, we propose Structured Feature Network (SF-Net) to address these challenges by effectively learn multiple levels of semantic information in the data. The proposed SF-Net extracts features in a structured manner and gradually encodes information at the frame level, the gloss level and the sentence level into the feature representation. The proposed SF-Net can be trained end-to-end without the help of other models or pre-training. We tested the proposed SF-Net on two large scale public SLR datasets collected from different continuous SLR scenarios. Results show that the proposed SF-Net clearly outperforms previous sequence level supervision based methods in terms of both accuracy and adaptability.

1 Introduction

Sign language is considered to be the most structured form of gestural communication method. It is commonly used by deaf people as their major way of daily communications but is difficult for common people to understand. Gloss, which represents the closest meaning of a sign in the natural language, is generally defined to be the unit of the sign language [26]. A gloss is typically made up by one or more hand gestures, motions, facial emotions and transitions in between them. A single change in one of these components can result in another sign that has a very different meaning (See Figure 1 for examples).

Figure 1: Samples of glosses that look similar in the Chinese Sign Language Dataset [14]. Yellow and blue boxes represent hand locations in the two frames. In each gloss pair, gloss on the top differs from the bottom one only either in motions or gestures. However, they are very different in meanings.

Continuous sign language recognition (SLR) aims to recognize glosses in a signing sequence. It is different from isolated SLR, in which each sign has been segmented and annotated independently. It is also different with sign language translation (SLT) [6], which involves an additional step to translate recognized glosses into a grammatical sentence. In continuous SLR, no segmentation and alignment but only the sentence level annotation for the whole signing sequence is given. This requires the model to learn not only frame level and gloss level features to distinguish different glosses, but also sentence level features to infer alignment and construct the sentence.

Recent years, deep learning [22] has achieved outstanding performance in many vision tasks. Successful work exists on applying deep learning techniques on continuous SLR [20, 19, 14, 21, 28]. However, the task remains challenging even with deep learning. On the one hand, models used in these methods manage to learn features and alignments from the frame level. This could have limited the representativeness of features as single frames are far from the completion of a gloss. Also, the number of frames that a gloss lasts may vary dramatically, which could introduce uncertainty in alignment learning. On the other hand, some methods still need the help of additional models such as Hidden Markov Models (HMMs) or language models to construct the final sentence. This could have limited the adaptability of the method as it requires careful tweaking of the whole system for a specific dataset.

In order to address these challenges, in this paper, we propose Structured Feature Network (SF-Net). The proposed SF-Net learns features in a structured manner to gradually encode information at the frame level, the gloss level and the sentence level into the feature representation. The translated sentence can be obtained by doing greedy decoding using the final features. As a result, the alignment can be inferred from the gloss level rather than from the frame level. While different network designs are used for different levels of feature learning, the whole network can be trained end-to-end without the help of other models and pre-training.

We tested the proposed SF-Net on two large scale public SLR datasets, Chinese Sign Language (CSL) dataset [14] and RWTH-PHOENIX-Weather-2014 dataset [18], which represent continuous SLR in a laboratory environment and real world respectively. The final results show that, the proposed SF-Net outperforms previous sequence level supervision based methods on both datasets. We also show in steps the effectiveness of several network designs in SF-Net.

2 Related Work

The work in this paper falls into the topic of sign language recognition (SLR), which is also related to sequence to sequence learning and human action and gesture recognition. We hereby provide a literature review on these topics.

Sign Language Recognition. SLR can be divided into isolated SLR and continuous SLR. Isolated SLR discusses scenarios where signs are segmented so that each sample of data contains only one running gloss as the recognition target. Much work exists on successfully recognizing isolated signs [23, 11, 12]. Differently, continuous SLR is a more challenging scenario where several running glosses are signed in a continuous sequence in a single sample of data. In this case, the task is weakly supervised as signs are not segmented, and only the overall transcripts are given, without temporal alignment information. Most existing continuous SLR methods divide the task into three stages, including temporal segmentation or alignment learning, isolated SLR, and sentence construction with language models [38, 21]. While these methods have achieved convincing performance, they may have to be trained with additional supervisions or the help of other pre-trained models, which requires careful tweaking of the whole system for a specific dataset. End-to-end continuous SLR methods also exist [3, 7, 14]. However, these methods learn features and alignments on the frame level, which may fail to fully investigate the semantic information in the data. Differently, the proposed SF-Net can capture different levels of information in the data by extracting features in a structured manner. Another related topic in the scope would be sign language translation (SLT) [6], which takes SLR as the first step and adds one additional step to translate recognized glosses into common sentences. In this paper, we focus on the SLR part only.

Sequence to Sequence Learning. It is natural to think of continuous SLR as a sequence to sequence task as it translates a sequence of running glosses into a sequence of words. Most methods in this topic are Encoder-Decoder framework [30] Connectionist Temporal Classification (CTC) based. The Encoder-Decoder framework nowadays generally incorporates the attention mechanism [2] to learn long term dependencies and the alignment between the source sequence and the target sequence. Alternatively, CTC aims to learn a comprehensive scoring function for the whole sequence instead of classification scores for each of the single frames. These two methods have been successfully applied to speech recognition [1, 5], text recognition [34, 8], video captioning [35, 4] and neural machine translation [24]. However, the unit in these tasks can be easily defined and processed (for example, source words in language translation). This is different from continuous SLR as the unit, which is supposed to be gloss, can hardly be pre-defined as they vary a lot in length. Moreover, Encoder-Decoder framework based methods generally need a lot of data to learn the mapping between the source and the target sequence. This is not available for continuous SLR.

Action and Gesture Recognition. Continuous SLR shares some similarities with action and gesture recognition as they all discuss body language. However, they are mostly based on different foundations. Gesture recognition generally discuss stationary hand shape or body postures. Most efforts fall on detecting key part of the body (such as hands) which may have significant impact on the following classification [29, 25]. Action recognition is closer to SLR as they both learn body motions in time series. Recent advances in network architecture have considerably improved the performance of action recognition in benchmark datasets [9, 31, 39]. However, actions in each of the samples are complete and well-defined, making them suitable for classification. On the contrary, several different glosses may appear in a continuous sequence in continuous SLR. Nonetheless, network architectures that can extract action features effectively have given us insights in designing the proposed SF-Net.

3 Structured Feature Network

Figure 2: Overview of the proposed SF-Net. Squares in the figure are feature maps while strip-shape rectangles are one-dimension feature vectors. Their copies represent the expansion in the temporal dimension. The three levels of feature extraction are distinguished using different colors.

Continuous SLR takes a sequence of signing frames as input and learn to directly output the target sequence of glosses in the right order. In this task, there are implicitly three levels of information in the data that need to be considered. First, the frame level. The signing gesture and facial emotion are important information for distinguishing different glosses. They are the bottom most level of information in the task and can be captured by processing and extracting features in the frames. Second, the gloss level. Signing glosses are made up of several gestures, emotions and motions (in fact, we can also consider holding on a gesture as one kind of body motion). As a result, independent frames are far from the completion of a gloss. Therefore, information of several frames has to be combined to form features in this level. Finally, the sentence level. Different glosses are performed in a continuous sequence without explicit segmentation. In order to align and translate the signing sequence to target sentence, gloss level features need to be re-organized in this level so that context information in the sequence can be encoded.

We propose the Structured Feature Network (SF-Net). Unlike previous methods that may not have fully investigated the information discussed above, the proposed SF-Net uses different network designs to learn features in three levels which can be paired to the levels of information in the task. By effectively learning features in this structured manner, information at the three levels can be gradually encoded into the final feature representation and the task can be made end-to-end trainable without the help of other methods or pre-training. An overview of the proposed SF-Net is shown in Figure 2.

3.1 Frame Level Feature

Feature learning at this level focuses on the gestures and facial emotions in each frame. Like in many other applications, this can be effectively done by stacking up several 2D convolutional layers. As each sample in continuous SLR is a sequence of signing frames, a mini-batch of samples can be represented as a 5D matrix IB×T×C×H×W, where B, T, C, H, W denotes the batch size, the length of the sequence, the number of channels in each frame, and the height and width of each frame respectively. The 2D convolution can be then done per sample per frame as:

Yi,t,k,y,z2D=c=0C-1h=0H-1w=0W-1Ii,t,c,h+y,w+zKk,c2D (1)

where i, t, k, y, z are indexes of the output, Y the output and K2D the 2D kernel.

This operation treats each frame independently, which may have some shortcomings in extracting features for sign language. This is because that there are many fast and small motions (such as quick finger movements) in sign language. These motions last only for a few frames and the difference between these frames may be too small to observe without comparing them directly. Therefore, in order to capture these fast and small motions, we propose to incorporate 3D convolutional layers [16] that take adjacent frames into account during feature extraction in the frame level. The 3D convolutional is done per sample as:

Yi,x,k,y,z3D=c=0C-1t=0T-1h=0H-1w=0W-1Ii,t+x,c,h+y,w+zKk,c3D (2)

where K3D is the 3D kernel. We do not reduce the temporal dimension during 3D convolution, so in SF-Net XT.

Inspired by the MiCT Network [39], after each 2D and 3D convolution, we merge features of the two branches with an cross domain element-wise summation. As a result, the final output is:

Yi,t,k,y,z=Yi,t,k,y,z3D+Yi,t,k,y,z2D (3)

This operation can speed up learning and allow training of deeper architectures. At the same time, it allows the 3D convolution branch only to learn residual temporal features, which is the fast and small motions in sign language for us, to compensate features learned in 2D convolution. These 3D convolutions have actually added another sub-level of feature learning in the frame level. As a result, instead of stacking up 2D convolution layers, we use several 2D/3D convolution blocks in the frame level of feature extraction (as shown in Figure 2). After the last convolution block, we conduct a global average pooling to reduce dimension. The final feature will be of dimension YB×T×K, where K is the number of channels in the last block.

3.2 Gloss Level Feature

Gloss is the unit of the sign language. However, in continuous SLR, the segmentation of these units is not available. This requires the network to align certain frames to a corresponding gloss in the target sentence. This alignment is hard to learn as isolated frames are far from the completion of a gloss. Although features in the frame level have encoded some fast motion information, the number of frames considered are still much smaller than the number of frames a gloss can last. It is therefore necessary to add a new level of feature learning to better encode gloss level information. We show the network design in this level in Figure 3.

Figure 3: Network design of the gloss level part of SF-Net. A framing operation is added in this level to capture gloss level motions and a regularizer is introduced to prevent overfitting. Framing settings in the figure have a window size of 3 and stride of 1. These settings are for illustration purpose only.

Inspired by the framing step in automatic speech recognition (ASR), we also add a framing step after the frame level feature extraction. Similar to the framing in ASR, given an input of length T, the window size L and the stride S, the number of meta frames generated is:

F=T-LS+1 (4)

and each meta frame contains L frames and is of dimension [L×K]. After framing, the output of the frame level feature will be transformed into a 4D matrix of dimension YB×F×L×K.

In order to reduce dimension and form more compact gloss level features, we add a long short term memory (LSTM) layer to learn the temporal dependencies between frames in meta frames. The LSTM layer can encode temporal information into the feature while also preserve the ordering of frames during encoding. This is an important reason that we choose LSTM over the others as the signing ordering is also key to distinguish glosses. By taking out the hidden state of the last frame as the feature of each meta frame, the output dimension of features in this level becomes MB×F×H, where H is the number of hidden nodes in the LSTM layer.

Note that the combination of the LSTM and the 3D convolution in the frame level has actually created an effective temporal modeling architecture, where the 3D convolution takes care of the short term fast motions and the LSTM learns slower motions that have longer temporal dependencies. This has fitted in the pattern of sign language as both slow and fast motions can appear in signing a gloss.

Furthermore, as the number of data available for training continuous SLR is limited, to prevent overfitting and fully develop the network capacity in the first two levels, we added a regularizer in the gloss level to enhance the generalization of the features. We first used a fully-connected layer and a followed Softmax activation to transform features of meta frames MB×F×H into a probability distribution, where each entry in the distribution represents the likelihood of the meta frame being the corresponding gloss in the vocabulary. Then, the regularizer is realized without additional supervision by forcing these distributions to be close to the ones obtained in the sentence level. Specifically, let Pgl be the probability distribution obtained in the gloss level and Psl the one obtained in the sentence level (which is also the one used for emitting the final output), we use Kullback-Leibler Divergence Loss:

Lg=-n=1NPnsllog(PnglPnsl) (5)

where N is the vocabulary size.

This regularizer is introduced after the first few epochs to ensure stable training.

3.3 Sentence Level Feature

Context information is important for continuous SLR and other sequence to sequence tasks to learn the alignment between the source and the target sequence. In this last level of sentence feature learning, we follow a standard setup used in many other sequence to sequence tasks. We add a Bi-Directional LSTM (BiLSTM) which takes as input the gloss level feature MB×F×H and re-organize these features to encode context information in both directions into the feature representation. The final sentence level feature will be of dimension OB×F×2H, as features in the two directions will be concatenated.

These features are then fed into a fully connected layer that casts them into the prediction space. We choose the Connectionist Temporal Classification (CTC) [10] as the loss function over the Encoder-Decoder framework as it tends to get overfitting in seen target sequence patterns. As a result, the loss function is:

Ls=LCTC=-log(P(𝒚|𝒙)) (6)

where 𝒚 is the target sequence of glosses and P(𝒚|𝒙) is the sum of probabilities of all decoding paths that will result in 𝒚 after collapsing repetitions and removing blanks.

When combined with the regularizer in the gloss level, the loss function becomes:

L=Ls+[E>Estart]Lg (7)

where E is the current training epoch index and Estart is the epoch index that the regularizer will be introduced. During testing, the final output can be obtained by simply doing greedy decoding on the probability Psl.

4 Experiments

We conducted experiments on two large scale continuous SLR datasets, Chinese Sign Language (CSL) dataset [14] and RWTH-PHOENIX-Weather-2014 dataset [18]. We show evidence on the effectiveness of several design choices of the proposed SF-Net and also compare it to other methods. Qualitative results of SF-Net on full videos are provided in the Appendix.

4.1 Datasets

Chinese Sign Language (CSL) dataset [14] is a dataset collected in a laboratory environment. There are 50 signers and 100 unique sentences in the dataset. Each signer has performed each of the 100 sentences for 5 times, giving in total 25,000 samples and more than 100 hours footage. Videos are collected with a Microsoft Kinect camera and post-processed to a unified resolution of 720 × 1280 and frames per second (FPS) of 30. The dataset also has a word-level version, where the same 50 signers have each performed 500 unique words once. As no official split is provided, we did the split ourselves and gave 20,000 and 5,000 samples to the training set and testing set respectively. When splitting the dataset, we have ensured that signers have no overlap in the two sets.

RWTH-PHOENIX-Weather-2014 dataset [18] is a real world SLR dataset which represents a more challenging scenario. It is recorded from a public television broadcast in Germany. It contains 6841 unique sentences performed by 9 signers. Signers all wear dark clothes and sentences are performed in front of an artificial grey background. There are about 80,000 running glosses in the dataset, giving in total more than 10 hours in length. It is much richer in vocabulary compared to the CSL dataset, which is of size 1231. Videos have been post-processed to a unified resolution of 210 × 260 and an FPS of 25. We follow the official split of the dataset, which gives 5672, 540, 629 samples to training, validation and testing respectively.

4.2 Settings

Our settings for the frame level part is shown in Figure 4. In the rest of the network, we use 1 LSTM layer with 512 hidden nodes and 1 BiLSTM layer with 256 hidden nodes in each direction respectively for the gloss level and sentence level part of the network. The window size L we choose for the gloss level framing is 12, which is approximately 0.5 seconds for both datasets. The framing stride S is set to 3. Batch normalization [15] is used after every 2D convolution and 2D/3D blocks. Moreover, sequence-wise batch normalization [1] is used for LSTM and BiLSTM layers.

Figure 4: Network settings for the frame level part.

For the CSL dataset, we central cropped all video frames to reduce the blank area in the frames. We then resized the frames to 224 × 224 as a final step of pre-processing. For the RWTH-PHOENIX-Weather-2014 dataset, we simply resized all frames to 256 × 256 and random cropped a 224 × 224 area as a way of data augmentation during training. We used Adam optimizer [17] for training the networks with an initial learning rate of 1e-4 and a weight decay of 1e-5. Learning rate was decreased by a factor of 0.5 in the half way of training. We trained the network for 40 and 60 epochs respectively for the two datasets.

We use the word error rate (WER) as the evaluation metric for the purpose of comparison with results reported in other work. It is defined as:

WER=#substitution+#deletion+#insertion#wordsinthetarget (8)

Note that, when the output is Chinese, we consider each Chinese character as a unique word for better comparison with results of previous methods.

4.3 Network Design Analyses

Figure 5: Comparison of feature maps after the first convolution (or block). We show 4 frames of 2 samples downsampled from 16 frames in the original sequence. For each sample, the first row is the original frame, the second row is the feature map learned without 3D convolution and the last row is the feature map learned with 3D convolution. In each of the feature maps, areas that have been given more attention are colored brighter.

2D/3D Convolution Block. We first tested the effectiveness of adding additional branches of 3D convolutions in the frame level feature extraction. We conducted experiments on both the word-level CSL dataset and sentence-level CSL dataset and compared the performance of the SF-Net when training with and without 3D convolutions. When doing experiments on the word-level dataset, the sentence level part of SF-Net has been removed. Also, a fixed length of 2s (60 frames) of video is cut out from a random position in the original word-level videos and downsampled to contain only 12 frames. By doing this, the framing stage in the gloss level part of SF-Net would only generate one meta frame. We use the feature of this meta frame for classifying the video. Results on the testing set are shown in Table 1.

We can see that, the 3D convolution branch has brought nearly 3.5% of accuracy gain compared to the accuracy obtained without 3D convolution in the word-level classification. Similarly, in sentence-level, the WER has reduced for more than 2%. This indicates that the fast and small motions that exist in sign language are indeed important information for distinguishing glosses. This information can be successfully captured by 3D convolutions. We give a feature map comparison in Figure 5 for further analyses.

Word Sentence
Without 3D 17.3 7.1
With 3D 13.0 4.7
Table 1: Comparison of performance when training with and without 3D convolutions. Results are classification error rate and WER for the word-level and sentence-level respectively.

It can be observed that, feature maps learned by 2D convolutions simply have highlights at arm, head and leg positions in the current frame. On the contrary, after 3D convolution is introduced, feature maps transformed to either have additional highlights at arm or hand positions in adjacent frames, or only have highlights for the moved portion of the body. Both can be a way of encoding fast motions. Moreover, we can see that feature maps learned by 2D/3D convolution block have shown fewer highlights in irrelevant areas, such as at leg areas. This may own to the branch merging strategy which helps achieve a better gradient propagation. Both these properties of 2D/3D convolution block can help stabilize learning and improve the final performance.

Figure 6: Comparison of alignment. We show 12 frames of 3 samples downsampled from 24 frames in the original sequence. For each sample, we first show outputs given by SF-Net trained without framing, and then the ones of SF-Net trained with framing and LSTM. Different bracket colors indicates different meta frames. Blank outputs are colored in blue, which will be removed in decoding. Outputs that will cause errors in decoding are colored in red.

Gloss Level Feature. We then tested the effectiveness of adding the gloss level feature extraction. We conducted experiments on both the CSL dataset (both word-level and sentence level) and the RWTH-PHOENIX-Weather-2014 dataset and compared the performance of the SF-Net when training with and without the gloss level part. After removing the gloss level part, we tried two approaches: do the framing but simple concatenate features in each meta frame without going through the LSTM layer, and remove both framing and LSTM where output features from the frame level will be fed into the sentence level directly. For word-level CSL dataset, only the former approach is used. Gloss level regularizer was not used in this set of experiments. Results on the testing set are shown in Table 2.

CSL RWTH
Word Sentence
Without framing - 11.9 46.7
With framing 19.1 8.8 45.0
With LSTM 13.0 4.7 40.8
Table 2: Comparison of performance when training with and without gloss level feature extraction. Results are classification error rate for word-level CSL dataset and WER for sentence-level CSL and RWTH datasets.

We can see that, there is a dramatic drop of performance for both datasets when the framing and LSTM in the gloss level part of the SF-Net are removed. This may mainly because that inferring the alignment between the input and the output directly from the frame level is much harder as supervision is only given on the whole sentence per glosses but not per frame states. Without framing, the searching space in decoding greatly increased and this may require more powerful context information encoding to learn.

We can also see that, when framing is used with the absence of LSTM layer, the results get better but still far below the performance when LSTM is used. This indicates that modeling of the temporal information in each meta frame is also important. Otherwise, there are may be too many redundant information to achieve effective sentence level learning. Finally, by fully implementing the gloss level design of the SF-Net, we achieved the best results in this set of experiments. We show three alignment samples in Figure 6 to better reveal how framing has improved performance.

The errors made by the model that is trained without framing are typical types of errors that we observed in frame level alignment prediction, where one error can distort the whole sequence output. On the contrary, this has been alleviated after framing is introduced. Framing has made the output prediction become much sparser (24 predictions (only show 12 due to page limit) compared to 5 predictions in Figure 6), which can reduce the probability of introducing these errors. Furthermore, the prediction becomes more accurate as the LSTM has encoded the temporal dependencies between frames in each meta frame into the feature representation.

Gloss Level Regularizer. Finally, we tested the effectiveness of having an addition regularizer in the gloss level. We conducted comparison experiments on the RWTH-PHOENIX-Weather-2014 dataset. We also tuned the value of Estart to see the impact of adding the regularizer in different stages of training. Results are shown in Table 3.

No reg Epoch
1 5 15 25
WER 40.8 42.7 40.2 38.4 38.1
Table 3: Comparison of performance on the RWTH-PHOENIX-Weather-2014 dataset when adding gloss level regularizer in different stages of training.

It can be observed that, when the regularizer is introduced in the early stage of training, the performance has dropped for nearly 2%. This may be because that the network can have very unstable output probability distributions in the early stage of training. This has made learning difficult and resulted in worse convergence. Different, when we add the regularizer in the medium stage of training, it helped improve the final performance for more than 2.5%. This has demonstrated its effectiveness.

However, we did not find similar observations when training the CSL dataset and it seems that the regularizer has little impact on the result. We believe it is because that the vocabulary size, sentence length and possible combinations of glosses are smaller in the CSL dataset. On the contrary, the RWTH-PHOENIX-Weather-2014 dataset has a richer vocabulary and contains non-repeated, longer sentences, and some glosses appear only for a few times. When learning on it, regularizer can help to prevent overfitting on seen sentence patterns and better develop the capacity of the first two levels of the network.

4.4 Overall Performance

We did a thorough comparison between the performance of the proposed SF-Net and previous methods. To fully investigate the performance of the proposed SF-Net, we added a set of experiments where we initialized the frame level and the gloss level parts of the network with parameters learned in training the word-level CSL dataset. This can help accelerate learning, though we observed that similar results can be obtained by training from scratch after adding the number of training epoch. Moreover, to fully investigate the capacity of the algorithm, we also conducted a set of experiments where we used ResNet-18 [13] as our backbone architecture with all non-bottleneck layers changed to 2D/3D convolution blocks. For fair comparison, we only considered previous methods that are based on sentence level supervision excluding methods using frame-level labels such as [21]). Methods that use other kinds of supervision (such as frame state labels) are not included in this section. We report the result in Table 4 and Table 5 for the two datasets respectively. Most results for other methods are collected from their original papers or dataset release papers. We only re-trained SubUNet [3] for the CSL dataset.

Methods WER
DTW-HMM [38] 28.4
LSTM [33] 26.4
S2VT [32] 25.5
LSTM-A [37] 24.3
LSTM-E [27] 23.2
HAN [36] 20.7
LS-HAN [14] 17.3
SubUNet [3] 11.0
SF-Net (scratch) 4.8
SF-Net 3.8
Table 4: Comparison of performance of different methods on the CSL dataset.
Methods WER
Dev Test
[18] 57.3 55.6
Deep Hand [19] 47.1 45.1
Deep Sign [20] 38.3 38.8
SubUNet [3] 40.8 40.7
[7] 39.4 38.7
LS-HAN [14] - 38.3
Align-iOpt [28] 37.1 36.7
SF-Net (scratch) 38.0 38.1
SF-Net 36.5 36.1
SF-Net(ResNet-18) 35.6 34.9
Table 5: Comparison of performance of different methods on the RWTH-PHOENIX-Weather-2014 dataset.

We can see that the proposed SF-Net has achieved the best performance among these methods on both two datasets, even when training from scratch. When training from pre-learned parameters in the word-level CSL dataset, we observed further improvements in accuracy. This has demonstrated the effectiveness and adaptability of SF-Net on learning in different scenarios.

However, we should still note that the performance on the RWTH-PHOENIX-Weather-2014 dataset is far from satisfactory for real world applications. The high diversity, large vocabulary, limited number of training samples and the weakly supervised nature of the task are all factors that have made this dataset challenging. Adding more regularizer or data are possible future work directions to level up the performance. Moreover, sign language is highly regional due to the lack of spreading, educating and standardizing, which has ended up with the co-existence of many different variations of the language around the world. This has dragged behind the development of algorithm and larger dataset in SLR. More work has to be done in bridging this gap in the future.

5 Conclusions

In this paper, we propose Structured Feature Network (SF-Net) to extract features from three levels of information that co-exist in continuous SLR. In the frame level, the proposed SF-Net incorporates 2D and 3D convolution to capture gesture, emotion and fast and small motion information. Then a framing step is added in the gloss level to generate meta frames which will be processed by LSTM to form gloss level features. These features will be further re-organized by the BiLSTM in the sentence level to encode context information.

We tested the proposed SF-Net on the CSL and the RWTH-PHOENIX-Weather-2014 datasets. Results have demonstrated the effectiveness of several designs in the network. Results also show that the proposed SF-Net has outperformed previous sentence level supervision based methods, in terms of both accuracy and adaptability.

References

  • [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In Proceedings of International Conference on Machine Learning, pp. 173–182. Cited by: §2, §4.2.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • [3] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden (2017) Subunets: end-to-end hand shape and continuous sign language recognition. In Proceedings of IEEE International Conference on Computer Vision, pp. 3075–3084. Cited by: §2, §4.4, Table 4, Table 5.
  • [4] Y. Chen, S. Wang, W. Zhang, and Q. Huang (2018) Less is more: picking informative frames for video captioning. In Proceedings of European Conference on Computer Vision, pp. 358–373. Cited by: §2.
  • [5] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4774–4778. Cited by: §2.
  • [6] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden (2018) Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7784–7793. Cited by: §1, §2.
  • [7] R. Cui, H. Liu, and C. Zhang (2017) Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7361–7369. Cited by: §2, Table 5.
  • [8] P. Doetsch, A. Zeyer, and H. Ney (2016) Bidirectional decoder networks for attention-based end-to-end offline handwriting recognition. In Proceedings of IEEE International Conference on Frontiers in Handwriting Recognition, pp. 361–366. Cited by: §2.
  • [9] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2018) SlowFast networks for video recognition. arXiv preprint arXiv:1812.03982. Cited by: §2.
  • [10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of International Conference on Machine learning, pp. 369–376. Cited by: §3.3.
  • [11] D. Guo, W. Zhou, H. Li, and M. Wang (2018) Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (1), pp. 8. Cited by: §2.
  • [12] D. Guo, W. Zhou, M. Wang, and H. Li (2016) Sign language recognition based on adaptive hmms with data augmentation. In Proceedings of IEEE International Conference on Image Processing, pp. 2876–2880. Cited by: §2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.4.
  • [14] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li (2018) Video-based sign language recognition without temporal segmentation. In AAAI Conference on Artificial Intelligence, Cited by: §A.2, Figure 1, §1, §1, §2, §4.1, Table 4, Table 5, §4.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.
  • [16] S. Ji, W. Xu, M. Yang, and K. Yu (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §3.1.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.2.
  • [18] O. Koller, J. Forster, and H. Ney (2015) Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141, pp. 108–125. Cited by: §A.1, §A.2, §1, §4.1, Table 5, §4.
  • [19] O. Koller, H. Ney, and R. Bowden (2016) Deep hand: how to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802. Cited by: §1, Table 5.
  • [20] O. Koller, O. Zargaran, H. Ney, and R. Bowden (2016) Deep sign: hybrid cnn-hmm for continuous sign language recognition. In Proceedings of British Machine Vision Conference, Cited by: §1, Table 5.
  • [21] O. Koller, S. Zargaran, and H. Ney (2017) Re-sign: re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4297–4305. Cited by: §1, §2, §4.4.
  • [22] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [23] T. Liu, W. Zhou, and H. Li (2016) Sign language recognition with long short-term memory. In Proceedings of IEEE International Conference on Image Processing, pp. 2871–2875. Cited by: §2.
  • [24] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.
  • [25] P. Molchanov, S. Gupta, K. Kim, and J. Kautz (2015) Hand gesture recognition with 3d convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7. Cited by: §2.
  • [26] S. C. Ong and S. Ranganath (2005) Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), pp. 873–891. Cited by: §1.
  • [27] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui (2016) Jointly modeling embedding and translation to bridge video and language. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602. Cited by: Table 4.
  • [28] J. Pu, W. Zhou, and H. Li (2019) Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4165–4174. Cited by: §1, Table 5.
  • [29] S. S. Rautaray and A. Agrawal (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review 43 (1), pp. 1–54. Cited by: §2.
  • [30] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §2.
  • [31] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
  • [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In Proceedings of IEEE International Conference on Computer Vision, pp. 4534–4542. Cited by: Table 4.
  • [33] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. Cited by: Table 4.
  • [34] P. Voigtlaender, P. Doetsch, and H. Ney (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In Proceedings of IEEE International Conference on Frontiers in Handwriting Recognition, pp. 228–233. Cited by: §2.
  • [35] B. Wang, L. Ma, W. Zhang, and W. Liu (2018) Reconstruction network for video captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631. Cited by: §2.
  • [36] W. Yang, J. Tao, and Z. Ye (2016) Continuous sign language recognition using level building based on fast hidden markov model. Pattern Recognition Letters 78, pp. 28–35. Cited by: Table 4.
  • [37] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville (2015) Describing videos by exploiting temporal structure. In Proceedings of IEEE International Conference on Computer Vision, pp. 4507–4515. Cited by: Table 4.
  • [38] J. Zhang, W. Zhou, and H. Li (2014) A threshold-based hmm-dtw approach for continuous sign language recognition. In Proceedings of International Conference on Internet Multimedia Computing and Service, pp. 237. Cited by: §2, Table 4.
  • [39] Y. Zhou, X. Sun, Z. Zha, and W. Zeng (2018) MiCT: mixed 3d/2d convolutional tube for human action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 449–458. Cited by: §2, §3.1.

Appendix A Appendix

A.1 Framing Window Size

We conducted a set of experiments on the RWTH-PHOENIX-Weather-2014 dataset [18] to investigate the impact of the framing window size on the final performance. We fully implemented the proposed SF-Net without the frame level regularizer and only tuned the window size. Results are in Table S1.

No frame Window Size
3 6 9 12 15 18
WER 46.7 45.9 43.1 40.7 40.8 41.0 41.5
Table S1: Comparison of performance on the RWTH-PHOENIX-Weather-2014 dataset when using different framing window sizes.

We can see that the performance has dropped when the window size is very small (3 or 6 frames). This may be because the number of frames is too small to really learn gloss level temporal dependencies, as we observe most glosses take around 500 ms to perform. Then, the performance stays relatively stable for window size from 9 to 18, even we can observe a tendency of performance decline if the size continue to grow. However we were not able to further increase it as we have to make the output sequence length to be longer than the target sequence length. We set the window size at 12 as to maximally reduce the number of meta frames without hurting the performance.

A.2 Qualitative Results

We show qualitative results of Structured Feature Network (SF-Net) on full videos for the RWTH-PHOENIX-Weather-2014 dataset [18] and the Chinese Sign Language (CSL) dataset [14] in Figure S1 and Figure S2 respectively.

The RWTH-PHOENIX-Weather-2014 dataset is richer in expression (vocabulary and sentence length) but less diverse in performance (number of signers and signers’ dressings). Sentences in the dataset are unique, so all sentences in the validation and testing set have not been seen by the network during training. We can see that, although the training set is relatively small (compared to other sequence to sequence tasks), the proposed SF-Net is able to recognize running glosses in very long sequence (more than 200 frames). Also, although false recognitions exist, they do not show to have affected other recognition in the sentence. This has demonstrated the robustness of SF-Net. Moreover, many of the false recognition made by SF-Net are close to the ground truth (e.g., northwest (northwest) to west (west), abswchseln (alternate) to wechselhaft (changeable)). They have little impact on understanding the whole sentence. However, bad cases (last 3 samples) also exist. Many of these cases are caused by infrequent glosses (e.g., kaum (barely) which only appears 41 times and druck (pressure) which only appears 95 times in all 65,227 glosses in the training set) and out-of-vocabulary glosses (e.g., noch-nord (to-north) and von-unten (from-underneath)). Note that when there are out-of-vocabulary glosses, adjacent recognitions may be affected. This is because unseen signing patterns can introduce uncertainty in alignment inference.

The CSL dataset contains less sentences but is much more diverse in performance, as it contains more signers and has not unified their dressings. We can see that, the proposed SR-Net is very capable in recognizing seen glosses even when they are signed by unseen signers who dressed different clothes (we have ensured signers in the training and testing sets do not have overlaps). Most of false recognitions are related to prepositions (e.g. gloss ‘of’) that have no influence on the meaning of the whole sentence and can be optionally removed in practice. We did not find many bad cases considering the low word error rate (WER) of 3.8%. Some (the last sample) may be caused by irregular signing of glosses performed by the signer.

Figure S1: Recognition results on full videos of the RWTH-PHOENIX-Weather-2014 dataset. \textcolorgreenDeletion, \textcolorblueinsertion and \textcolorredsubstitution errors are colored in \textcolorgreengreen, \textcolorblueblue and \textcolorredred respectively. Sentences below each sample are ground truth and then results of SF-Net. Translations to English are single word based. __on__ and __off__ are starting and ending flags while ⁢* represents absence of glosses. Samples are chosen from the validation and testing sets.
Figure S2: Recognition results on full videos of the CSL dataset. \textcolorgreenDeletion, \textcolorblueinsertion and \textcolorredsubstitution errors are colored in \textcolorgreengreen, \textcolorblueblue and \textcolorredred respectively. Sentences below each sample are ground truth and then results of SF-Net. Translations to English are single word based. ⁢* in sentences represents absence of glosses. Samples are chosen from the testing set.