Abstract
Sign language translation (SLT) addresses the problem of translatinginformation from a sign language in video to a spoken language in text.Existing studies, while showing progress, are often limited to narrow domainsand/or few sign languages and struggle with open-domain tasks. In this paper,we push forward the frontier of SLT by scaling pretraining data, model size,and number of translation directions. We perform large-scale SLT pretraining ondifferent data including 1) noisy multilingual YouTube SLT data, 2) paralleltext corpora, and 3) SLT data augmented by translating video captions to otherlanguages with off-the-shelf machine translation models. We unify differentpretraining tasks with task-specific prompts under the encoder-decoderarchitecture, and initialize the SLT model with pretrained (m/By)T5 modelsacross model sizes. SLT pretraining results on How2Sign and FLEURS-ASL#0 (ASLto 42 spoken languages) demonstrate the significance of data/model scaling andcross-lingual cross-modal transfer, as well as the feasibility of zero-shotSLT. We finetune the pretrained SLT models on 5 downstream open-domain SLTbenchmarks covering 5 sign languages. Experiments show substantial qualityimprovements over the vanilla baselines, surpassing the previousstate-of-the-art (SOTA) by wide margins.