Abstract
Automatic Sign Language Translation requires the integration of both computervision and natural language processing to effectively bridge the communicationgap between sign and spoken languages. However, the deficiency in large-scaletraining data to support sign language translation means we need to leverageresources from spoken language. We introduce, Sign2GPT, a novel framework forsign language translation that utilizes large-scale pretrained vision andlanguage models via lightweight adapters for gloss-free sign languagetranslation. The lightweight adapters are crucial for sign languagetranslation, due to the constraints imposed by limited dataset sizes and thecomputational requirements when training with long sign videos. We also proposea novel pretraining strategy that directs our encoder to learn signrepresentations from automatically extracted pseudo-glosses without requiringgloss order information or annotations. We evaluate our approach on two publicbenchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014Tand CSL-Daily, and improve on state-of-the-art gloss-free translationperformance with a significant margin.