Abstract
Understanding and predicting emotion from videos has gathered significantattention in recent studies, driven by advancements in video large languagemodels (VideoLLMs). While advanced methods have made progress in video emotionanalysis, the intrinsic nature of emotions poses significant challenges.Emotions are characterized by dynamic and cues-dependent properties, making itdifficult to understand complex and evolving emotional states with reasonablerationale. To tackle these challenges, we propose a novel affective cues-guidedreasoning framework that unifies fundamental attribute perception, expressionanalysis, and high-level emotional understanding in a stage-wise manner. At thecore of our approach is a family of video emotion foundation models (VidEmo),specifically designed for emotion reasoning and instruction-following. Thesemodels undergo a two-stage tuning process: first, curriculum emotion learningfor injecting emotion knowledge, followed by affective-tree reinforcementlearning for emotion reasoning. Moreover, we establish a foundational datainfrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG)consisting of 2.1M diverse instruction-based samples. Emo-CFG includesexplainable emotional question-answering, fine-grained captions, and associatedrationales, providing essential resources for advancing emotion understandingtasks. Experimental results demonstrate that our approach achieves competitiveperformance, setting a new milestone across 15 face perception tasks.