Abstract
Facial affective behavior analysis (FABA) is crucial for understanding humanmental states from images. However, traditional approaches primarily deploymodels to discriminate among discrete emotion categories, and lack the finegranularity and reasoning capability for complex facial behaviors. The adventof Multi-modal Large Language Models (MLLMs) has been proven successful ingeneral visual understanding tasks. However, directly harnessing MLLMs for FABAis challenging due to the scarcity of datasets and benchmarks, neglectingfacial prior knowledge, and low training efficiency. To address thesechallenges, we introduce (i) an instruction-following dataset for two FABAtasks, e.g., emotion and action unit recognition, (ii) a benchmark FABA-Benchwith a new metric considering both recognition and generation ability, and(iii) a new MLLM "EmoLA" as a strong baseline to the community. Our initiativeon the dataset and benchmarks reveal the nature and rationale of facialaffective behaviors, i.e., fine-grained facial movement, interpretability, andreasoning. Moreover, to build an effective and efficient FABA MLLM, weintroduce a facial prior expert module with face structure knowledge and alow-rank adaptation module into pre-trained MLLM. We conduct extensiveexperiments on FABA-Bench and four commonly-used FABA datasets. The resultsdemonstrate that the proposed facial prior expert can boost the performance andEmoLA achieves the best results on our FABA-Bench. On commonly-used FABAdatasets, EmoLA is competitive rivaling task-specific state-of-the-art models.