Abstract
Fine-grained emotion recognition (FER) plays a vital role in various fields,such as disease diagnosis, personalized recommendations, and multimedia mining.However, existing FER methods face three key challenges in real-worldapplications: (i) they rely on large amounts of continuously annotated data toensure accuracy since emotions are complex and ambiguous in reality, which iscostly and time-consuming; (ii) they cannot capture the temporal heterogeneitycaused by changing emotion patterns, because they usually assume that thetemporal correlation within sampling periods is the same; (iii) they do notconsider the spatial heterogeneity of different FER scenarios, that is, thedistribution of emotion information in different data may have bias orinterference. To address these challenges, we propose a Spatio-TemporalFuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically,ST-F2M first divides the multi-modal videos into multiple views, and each viewcorresponds to one modality of one emotion. Multiple randomly selected viewsfor the same emotion form a meta-training task. Next, ST-F2M uses an integratedmodule with spatial and temporal convolutions to encode the data of each task,reflecting the spatial and temporal heterogeneity. Then it adds fuzzy semanticinformation to each task based on generalized fuzzy rules, which helps handlethe complexity and ambiguity of emotions. Finally, ST-F2M learnsemotion-related general meta-knowledge through meta-recurrent neural networksto achieve fast and robust fine-grained emotion recognition. Extensiveexperiments show that ST-F2M outperforms various state-of-the-art methods interms of accuracy and model efficiency. In addition, we construct ablationstudies and further analysis to explore why ST-F2M performs well.