SignDiff: Learning Diffusion Models for American Sign Language Production

Abstract

The field of Sign Language Production (SLP) lacked a large-scale, pre-trainedmodel based on deep learning for continuous American Sign Language (ASL)production in the past decade. This limitation hampers communication for allindividuals with disabilities relying on ASL. To address this issue, weundertook the secondary development and utilization of How2Sign, one of thelargest publicly available ASL datasets. Despite its significance, priorresearchers in the field of sign language have not effectively employed thiscorpus due to the intricacies involved in American Sign Language Production(ASLP). To conduct large-scale ASLP, we propose SignDiff based on the latest work inrelated fields, which is a dual-condition diffusion pre-training model that cangenerate human sign language speakers from a skeleton pose. SignDiff has anovel Frame Reinforcement Network called FR-Net, similar to dense human poseestimation work, which enhances the correspondence between text lexical symbolsand sign language dense pose frames reduce the occurrence of multiple fingersin the diffusion model. In addition, our ASLP method proposes two new improvedmodules and a new loss function to improve the accuracy and quality of signlanguage skeletal posture and enhance the ability of the model to train onlarge-scale data. We propose the first baseline for ASL production and report the scores of17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We also evaluated ourmodel on the previous mainstream dataset called PHOENIX14T, and the mainexperiments achieved the results of SOTA. In addition, our image quality farexceeds all previous results by 10 percentage points on the SSIM indicator.Finally, we conducted ablation studies and qualitative evaluations fordiscussion.

Quick Read (beta)

loading the full paper ...