Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Abstract

Sign Language Translation (SLT) aims to map sign language videos to spokenlanguage text. A common approach relies on gloss annotations as an intermediaterepresentation, decomposing SLT into two sub-tasks: video-to-gloss recognitionand gloss-to-text translation. While effective, this paradigm depends onexpert-annotated gloss labels, which are costly and rarely available inexisting datasets, limiting its scalability. To address this challenge, wepropose a gloss-free pseudo gloss generation framework that eliminates the needfor human-annotated glosses while preserving the structured intermediaterepresentation. Specifically, we prompt a Large Language Model (LLM) with a fewexample text-gloss pairs using in-context learning to produce draft signglosses from spoken language text. To enhance the correspondence betweenLLM-generated pseudo glosses and the sign sequences in video, we correct theordering in the pseudo glosses for better alignment via a weakly supervisedlearning process. This reordering facilitates the incorporation of auxiliaryalignment objectives, and allows for the use of efficient supervision via aConnectionist Temporal Classification (CTC) loss. We train our SLT mode, whichconsists of a vision encoder and a translator, through a three-stage pipeline,which progressively narrows the modality gap between sign language and spokenlanguage. Despite its simplicity, our approach outperforms previousstate-of-the-art gloss-free frameworks on two SLT benchmarks and achievescompetitive results compared to gloss-based methods.

Quick Read (beta)

loading the full paper ...