Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

Abstract

This paper examines the integration of real-time talking-head generation forinterviewer training, focusing on overcoming challenges in Audio FeatureExtraction (AFE), which often introduces latency and limits responsiveness inreal-time applications. To address these issues, we propose and implement afully integrated system that replaces conventional AFE models with Open AI'sWhisper, leveraging its encoder to optimize processing and improve overallsystem efficiency. Our evaluation of two open-source real-time models acrossthree different datasets shows that Whisper not only accelerates processing butalso improves specific aspects of rendering quality, resulting in morerealistic and responsive talking-head interactions. These advancements make thesystem a more effective tool for immersive, interactive training applications,expanding the potential of AI-driven avatars in interviewer training.

Quick Read (beta)

loading the full paper ...