Abstract
Social media's global reach amplifies the spread of information, highlightingthe need for robust Natural Language Processing tasks like stance detectionacross languages and modalities. Prior research predominantly focuses ontext-only inputs, leaving multimodal scenarios, such as those involving bothimages and text, relatively underexplored. Meanwhile, the prevalence ofmultimodal posts has increased significantly in recent years. Althoughstate-of-the-art Vision-Language Models (VLMs) show promise, their performanceon multimodal and multilingual stance detection tasks remains largelyunexamined. This paper evaluates state-of-the-art VLMs on a newly extendeddataset covering seven languages and multimodal inputs, investigating their useof visual cues, language-specific performance, and cross-modality interactions.Our results show that VLMs generally rely more on text than images for stancedetection and this trend persists across languages. Additionally, VLMs relysignificantly more on text contained within the images than other visualcontent. Regarding multilinguality, the models studied tend to generateconsistent predictions across languages whether they are explicitlymultilingual or not, although there are outliers that are incongruous withmacro F1, language support, and model size.