Abstract
Sign language transition generation seeks to convert discrete sign languagesegments into continuous sign videos by synthesizing smooth transitions.However,most existing methods merely concatenate isolated signs, resulting inpoor visual coherence and semantic accuracy in the generated videos. Unliketextual languages,sign language is inherently rich in spatial-temporal cues,making it more complex to model. To address this,we propose StgcDiff, agraph-based conditional diffusion framework that generates smooth transitionsbetween discrete signs by capturing the unique spatial-temporal dependencies ofsign language. Specifically, we first train an encoder-decoder architecture tolearn a structure-aware representation of spatial-temporal skeleton sequences.Next, we optimize a diffusion denoiser conditioned on the representationslearned by the pre-trained encoder, which is tasked with predicting transitionframes from noise. Additionally, we design the Sign-GCN module as the keycomponent in our framework, which effectively models the spatial-temporalfeatures. Extensive experiments conducted on the PHOENIX14T, USTC-CSL100,andUSTC-SLR500 datasets demonstrate the superior performance of our method.