Abstract
Misinformation is a prevalent societal issue due to its potential high risks.Out-of-context (OOC) misinformation, where authentic images are repurposed withfalse text, is one of the easiest and most effective ways to mislead audiences.Current methods focus on assessing image-text consistency but lack convincingexplanations for their judgments, which is essential for debunkingmisinformation. While Multimodal Large Language Models (MLLMs) have richknowledge and innate capability for visual reasoning and explanationgeneration, they still lack sophistication in understanding and discovering thesubtle crossmodal differences. In this paper, we introduce SNIFFER, a novelmultimodal large language model specifically engineered for OOC misinformationdetection and explanation. SNIFFER employs two-stage instruction tuning onInstructBLIP. The first stage refines the model's concept alignment of genericobjects with news-domain entities and the second stage leverages language-onlyGPT-4 generated OOC-specific instruction data to fine-tune the model'sdiscriminatory powers. Enhanced by external tools and retrieval, SNIFFER notonly detects inconsistencies between text and image but also utilizes externalknowledge for contextual verification. Our experiments show that SNIFFERsurpasses the original MLLM by over 40% and outperforms state-of-the-artmethods in detection accuracy. SNIFFER also provides accurate and persuasiveexplanations as validated by quantitative and human evaluations.