SCOPE: Sign Language Contextual Processing with Embedding from LLMs

Abstract

Sign languages, used by around 70 million Deaf individuals globally, arevisual languages that convey visual and contextual information. Current methodsin vision-based sign language recognition (SLR) and translation (SLT) strugglewith dialogue scenes due to limited dataset diversity and the neglect ofcontextually relevant information. To address these challenges, we introduceSCOPE (Sign language Contextual Processing with Embedding from LLMs), a novelcontext-aware vision-based SLR and SLT framework. For SLR, we utilize dialoguecontexts through a multi-modal encoder to enhance gloss-level recognition. Forsubsequent SLT, we further fine-tune a Large Language Model (LLM) byincorporating prior conversational context. We also contribute a new signlanguage dataset that contains 72 hours of Chinese sign language videos incontextual dialogues across various scenarios. Experimental results demonstratethat our SCOPE framework achieves state-of-the-art performance on multipledatasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover,surveys conducted with participants from the Deaf community further validatethe robustness and effectiveness of our approach in real-world applications.Both our dataset and code will be open-sourced to facilitate further research.

Quick Read (beta)

loading the full paper ...