Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Abstract

Our objective is to translate continuous sign language into spoken languagetext. Inspired by the way human interpreters rely on context for accuratetranslation, we incorporate additional contextual cues together with thesigning video, into a new translation framework. Specifically, besides visualsign recognition features that encode the input video, we integratecomplementary textual information from (i) captions describing the backgroundshow, (ii) translation of previous sentences, as well as (iii) pseudo-glossestranscribing the signing. These are automatically extracted and inputted alongwith the visual features to a pre-trained large language model (LLM), which wefine-tune to generate spoken language translations in text form. Throughextensive ablation studies, we show the positive contribution of each input cueto the translation performance. We train and evaluate our approach on BOBSL --the largest British Sign Language dataset currently available. We show that ourcontextual approach significantly enhances the quality of the translationscompared to previously reported results on BOBSL, and also to state-of-the-artmethods that we implement as baselines. Furthermore, we demonstrate thegenerality of our approach by applying it also to How2Sign, an American SignLanguage dataset, and achieve competitive results.

Quick Read (beta)

loading the full paper ...