Abstract
Recently, emotion recognition based on physiological signals has emerged as afield with intensive research. The utilization of multi-modal, multi-channelphysiological signals has significantly improved the performance of emotionrecognition systems, due to their complementarity. However, effectivelyintegrating emotion-related semantic information from different modalities andcapturing inter-modal dependencies remains a challenging issue. Many existingmultimodal fusion methods ignore either token-to-token or channel-to-channelcorrelations of multichannel signals from different modalities, which limitsthe classification capability of the models to some extent. In this paper, wepropose a comprehensive perspective of multimodal fusion that integrateschannel-level and token-level cross-modal interactions. Specifically, weintroduce a unified cross attention module called Token-chAnnel COmpound (TACO)Cross Attention to perform multimodal fusion, which simultaneously modelschannel-level and token-level dependencies between modalities. Additionally, wepropose a 2D position encoding method to preserve information about the spatialdistribution of EEG signal channels, then we use two transformer encoders aheadof the fusion module to capture long-term temporal dependencies from the EEGsignal and the peripheral physiological signal, respectively.Subject-independent experiments on emotional dataset DEAP and Dreamerdemonstrate that the proposed model achieves state-of-the-art performance.