Abstract
Sign language generation aims to produce diverse sign representations basedon spoken language. However, achieving realistic and naturalistic generationremains a significant challenge due to the complexity of sign language, whichencompasses intricate hand gestures, facial expressions, and body movements. Inthis work, we introduce PHOENIX14T+, an extended version of the widely-usedRWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations:Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, forrealistic sign language generation, consisting of three stages: text-drivenpose modalities co-generation, online collaborative correction ofmultimodality, and realistic sign video synthesis. First, by incorporating textsemantics, we design a joint sign language generator to simultaneously produceposture coordinates, gesture actions, and body movements. The text encoder,based on a Transformer architecture, extracts semantic features, while across-modal attention mechanism integrates these features to generate diversesign language representations, ensuring accurate mapping and controlling thediversity of modal features. Next, online collaborative correction isintroduced to refine the generated pose modalities using a dynamic lossweighting strategy and cross-modal attention, facilitating the complementarityof information across modalities, eliminating spatiotemporal conflicts, andensuring semantic coherence and action consistency. Finally, the corrected posemodalities are fed into a pre-trained video generation network to producehigh-fidelity sign language videos. Extensive experiments demonstrate thatSignAligner significantly improves both the accuracy and expressiveness of thegenerated sign videos.