Abstract
Emotional Mimicry Intensity (EMI) estimation serves as a critical technologyfor understanding human social behavior and enhancing human-computerinteraction experiences, where the core challenge lies in dynamic correlationmodeling and robust fusion of multimodal temporal signals. To address thelimitations of existing methods in insufficient exploitation of modalsynergistic effects, noise sensitivity, and limited fine-grained alignmentcapabilities, this paper proposes a dual-stage cross-modal alignment framework.First, we construct vision-text and audio-text contrastive learning networksbased on an improved CLIP architecture, achieving preliminary alignment in thefeature space through modality-decoupled pre-training. Subsequently, we designa temporal-aware dynamic fusion module that combines Temporal ConvolutionalNetworks (TCN) and gated bidirectional LSTM to respectively capture themacro-evolution patterns of facial expressions and local dynamics of acousticfeatures. Innovatively, we introduce a quality-guided modality fusion strategythat enables modality compensation under occlusion and noisy scenarios throughdifferentiable weight allocation. Experimental results on the Hume-Vidmimic2dataset demonstrate that our method achieves an average Pearson correlationcoefficient of 0.35 across six emotion dimensions, outperforming the bestbaseline by 40\%. Ablation studies further validate the effectiveness of thedual-stage training strategy and dynamic fusion mechanism, providing a noveltechnical pathway for fine-grained emotion analysis in open environments.