Abstract
Achieving high synchronization in the synthesis of realistic, speech-driventalking head videos presents a significant challenge. A lifelike talking headrequires synchronized coordination of subject identity, lip movements, facialexpressions, and head poses. The absence of these synchronizations is afundamental flaw, leading to unrealistic results. To address the critical issueof synchronization, identified as the ''devil'' in creating realistic talkingheads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer withGaussian Splatting to ensure consistent subject identity preservation and aFace-Sync Controller that aligns lip movements with speech while innovativelyusing a 3D facial blendshape model to reconstruct accurate facial expressions.To ensure natural head movements, we propose a Head-Sync Stabilizer, whichoptimizes head poses for greater stability. Additionally, SyncTalk++ enhancesrobustness to out-of-distribution (OOD) audio by incorporating an ExpressionGenerator and a Torso Restorer, which generate speech-matched facialexpressions and seamless torso regions. Our approach maintains consistency andcontinuity in visual details across frames and significantly improves renderingspeed and quality, achieving up to 101 frames per second. Extensive experimentsand user studies demonstrate that SyncTalk++ outperforms state-of-the-artmethods in synchronization and realism. We recommend watching the supplementaryvideo: https://ziqiaopeng.github.io/synctalk++.