Abstract
Recent advances in diffusion-based lip-syncing generative models havedemonstrated their ability to produce highly synchronized talking face videosfor visual dubbing. Although these models excel at lip synchronization, theyoften struggle to maintain fine-grained control over facial details ingenerated images. In this work, we identify "lip averaging" phenomenon wherethe model fails to preserve subtle facial details when dubbing unseenin-the-wild videos. This issue arises because the commonly used UNet backboneprimarily integrates audio features into visual representations in the latentspace via cross-attention mechanisms and multi-scale fusion, but it strugglesto retain fine-grained lip details in the generated faces. To address thisissue, we propose UnAvgLip, which extracts identity embeddings from referencevideos to generate highly faithful facial sequences while maintaining accuratelip synchronization. Specifically, our method comprises two primary components:(1) an Identity Perceiver module that encodes facial embeddings to align withconditioned audio features; and (2) an ID-CrossAttn module that injects facialembeddings into the generation process, enhancing model's capability ofidentity retention. Extensive experiments demonstrate that, at a modesttraining and inference cost, UnAvgLip effectively mitigates the "averaging"phenomenon in lip inpainting, significantly preserving unique facialcharacteristics while maintaining precise lip synchronization. Compared withthe original approach, our method demonstrates significant improvements of 5%on the identity consistency metric and 2% on the SSIM metric across twobenchmark datasets (HDTF and LRW).