Abstract
Recent breakthroughs in video AIGC have ushered in a transformative era foraudio-driven human animation. However, conventional video dubbing techniquesremain constrained to mouth region editing, resulting in discordant facialexpressions and body gestures that compromise viewer immersion. To overcomethis limitation, we introduce sparse-frame video dubbing, a novel paradigm thatstrategically preserves reference keyframes to maintain identity, iconicgestures, and camera trajectories while enabling holistic, audio-synchronizedfull-body motion editing. Through critical analysis, we identify why naiveimage-to-video models fail in this task, particularly their inability toachieve adaptive conditioning. Addressing this, we propose InfiniteTalk, astreaming audio-driven generator designed for infinite-length long sequencedubbing. This architecture leverages temporal context frames for seamlessinter-chunk transitions and incorporates a simple yet effective samplingstrategy that optimizes control strength via fine-grained reference framepositioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasetsdemonstrate state-of-the-art performance. Quantitative metrics confirm superiorvisual realism, emotional coherence, and full-body motion synchronization.