Abstract
Since American Sign Language (ASL) has no standard written form, Deaf signersfrequently share videos in order to communicate in their native language.However, since both hands and face convey critical linguistic information insigned languages, sign language videos cannot preserve signer privacy. Whilesigners have expressed interest, for a variety of applications, in signlanguage video anonymization that would effectively preserve linguisticcontent, attempts to develop such technology have had limited success, giventhe complexity of hand movements and facial expressions. Existing approachesrely predominantly on precise pose estimations of the signer in video footageand often require sign language video datasets for training. These requirementsprevent them from processing videos 'in the wild,' in part because of thelimited diversity present in current sign language video datasets. To addressthese limitations, our research introduces DiffSLVA, a novel methodology thatutilizes pre-trained large-scale diffusion models for zero-shot text-guidedsign language video anonymization. We incorporate ControlNet, which leverageslow-level image features such as HED (Holistically-Nested Edge Detection)edges, to circumvent the need for pose estimation. Additionally, we develop aspecialized module dedicated to capturing facial expressions, which arecritical for conveying essential linguistic information in signed languages. Wethen combine the above methods to achieve anonymization that better preservesthe essential linguistic content of the original signer. This innovativemethodology makes possible, for the first time, sign language videoanonymization that could be used for real-world applications, which would offersignificant benefits to the Deaf and Hard-of-Hearing communities. Wedemonstrate the effectiveness of our approach with a series of signeranonymization experiments.