UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Abstract

Audio-driven 3D facial animation aims to map input audio to realistic facialmotion. Despite significant progress, limitations arise from inconsistent 3Dannotations, restricting previous models to training on specific annotationsand thereby constraining the training scale. In this work, we presentUniTalker, a unified model featuring a multi-head architecture designed toeffectively leverage datasets with varied annotations. To enhance trainingstability and ensure consistency among multi-head outputs, we employ threetraining strategies, namely, PCA, model warm-up, and pivot identity embedding.To expand the training scale and diversity, we assemble A2F-Bench, comprisingfive publicly available datasets and three newly curated datasets. Thesedatasets contain a wide range of audio domains, covering multilingual speechvoices and songs, thereby scaling the training data from commonly employeddatasets, typically less than 1 hour, to 18.5 hours. With a single trainedUniTalker model, we achieve substantial lip vertex error reductions of 9.2% forBIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalkerexhibits promise as the foundation model for audio-driven facial animationtasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhancesperformance on each dataset, with an average error reduction of 6.3% onA2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only halfthe data surpasses prior state-of-the-art models trained on the full dataset.The code and dataset are available at the project pagehttps://github.com/X-niper/UniTalker.

Quick Read (beta)

loading the full paper ...