Cross-modality Data Augmentation for End-to-End Sign Language Translation

Abstract

End-to-end sign language translation (SLT) aims to convert sign languagevideos into spoken language texts directly without intermediaterepresentations. It has been a challenging task due to the modality gap betweensign videos and texts and the data scarcity of labeled data. Due to thesechallenges, the input and output distributions of end-to-end sign languagetranslation (i.e., video-to-text) are less effective compared to thegloss-to-text approach (i.e., text-to-text). To tackle these challenges, wepropose a novel Cross-modality Data Augmentation (XmDA) framework to transferthe powerful gloss-to-text translation capabilities to end-to-end sign languagetranslation (i.e. video-to-text) by exploiting pseudo gloss-text pairs from thesign gloss translation model. Specifically, XmDA consists of two keycomponents, namely, cross-modality mix-up and cross-modality knowledgedistillation. The former explicitly encourages the alignment between sign videofeatures and gloss embeddings to bridge the modality gap. The latter utilizesthe generation knowledge from gloss-to-text teacher models to guide the spokenlanguage text generation. Experimental results on two widely used SLT datasets,i.e., PHOENIX-2014T and CSL-Daily, demonstrate that the proposed XmDA frameworksignificantly and consistently outperforms the baseline models. Extensiveanalyses confirm our claim that XmDA enhances spoken language text generationby reducing the representation distance between videos and texts, as well asimproving the processing of low-frequency words and long sentences.

Quick Read (beta)

loading the full paper ...