Abstract
Multimodal sentiment analysis enhances conventional sentiment analysis, whichtraditionally relies solely on text, by incorporating information fromdifferent modalities such as images, text, and audio. This paper proposes anovel multimodal sentiment analysis architecture that integrates text and imagedata to provide a more comprehensive understanding of sentiments. For textfeature extraction, we utilize BERT, a natural language processing model. Forimage feature extraction, we employ DINOv2, a vision-transformer-based model.The textual and visual latent features are integrated using proposed fusiontechniques, namely the Basic Fusion Model, Self Attention Fusion Model, andDual Attention Fusion Model. Experiments on three datasets, Memotion 7kdataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viabilityand practicality of the proposed multimodal architecture.