Abstract
Sign language recognition and translation first uses a recognition module togenerate glosses from sign language videos and then employs a translationmodule to translate glosses into spoken sentences. Most existing works focus onthe recognition step, while paying less attention to sign language translation.In this work, we propose a task-aware instruction network, namely TIN-SLT, forsign language translation, by introducing the instruction module and thelearning-based feature fuse strategy into a Transformer network. In this way,the pre-trained model's language ability can be well explored and utilized tofurther boost the translation performance. Moreover, by exploring therepresentation space of sign language glosses and target spoken language, wepropose a multi-level data augmentation scheme to adjust the data distributionof the training set. We conduct extensive experiments on two challengingbenchmark datasets, PHOENIX-2014-T and ASLG-PC12, on which our methodoutperforms former best solutions by 1.65 and 1.42 in terms of BLEU-4. Our codeis published at https://github.com/yongcaoplus/TIN-SLT.