Abstract
Sign language translation (SLT) is a challenging task that involvestranslating sign language images into spoken language. For SLT models toperform this task successfully, they must bridge the modality gap and identifysubtle variations in sign language components to understand their meaningsaccurately. To address these challenges, we propose a novel gloss-free SLTframework called Multimodal Sign Language Translation (MMSLT), which leveragesthe representational capabilities of off-the-shelf multimodal large languagemodels (MLLMs). Specifically, we generate detailed textual descriptions of signlanguage components using MLLMs. Then, through our proposed multimodal-languagepre-training module, we integrate these description features with sign videofeatures to align them within the spoken sentence space. Our approach achievesstate-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily,highlighting the potential of MLLMs to be effectively utilized in SLT.