Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation

Abstract

Prior work on Sign Language Translation has shown that having a mid-levelsign gloss representation (effectively recognizing the individual signs)improves the translation performance drastically. In fact, the currentstate-of-the-art in translation requires gloss level tokenization in order towork. We introduce a novel transformer based architecture that jointly learnsContinuous Sign Language Recognition and Translation while being trainable inan end-to-end manner. This is achieved by using a Connectionist TemporalClassification (CTC) loss to bind the recognition and translation problems intoa single unified architecture. This joint approach does not require anyground-truth timing information, simultaneously solving two co-dependantsequence-to-sequence learning problems and leads to significant performancegains. We evaluate the recognition and translation performances of our approaches onthe challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We reportstate-of-the-art sign language recognition and translation results achieved byour Sign Language Transformers. Our translation networks outperform both signvideo to spoken language and gloss to spoken language translation models, insome cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). Wealso share new baseline translation results using transformer networks forseveral other text-to-text sign language translation tasks.

Quick Read (beta)

loading the full paper ...