Abstract
To be truly understandable and accepted by Deaf communities, an automaticSign Language Production (SLP) system must generate a photo-realistic signer.Prior approaches based on graphical avatars have proven unpopular, whereasrecent neural SLP works that produce skeleton pose sequences have been shown tobe not understandable to Deaf viewers. In this paper, we propose SignGAN, the first SLP model to producephoto-realistic continuous sign language videos directly from spoken language.We employ a transformer architecture with a Mixture Density Network (MDN)formulation to handle the translation from spoken language to skeletal pose. Apose-conditioned human synthesis model is then introduced to generate aphoto-realistic sign language video from the skeletal pose sequence. Thisallows the photo-realistic production of sign videos directly translated fromwritten text. We further propose a novel keypoint-based loss function, which significantlyimproves the quality of synthesized hand images, operating in the keypointspace to avoid issues caused by motion blur. In addition, we introduce a methodfor controllable video generation, enabling training on large, diverse signlanguage datasets and providing the ability to control the signer appearance atinference. Using a dataset of eight different sign language interpreters extracted frombroadcast footage, we show that SignGAN significantly outperforms all baselinemethods for quantitative metrics and human perceptual studies.