COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

Abstract

Despite progress in comment-aware multimodal and multilingual summarizationfor English and Chinese, research in Indian languages remains limited. Thisstudy addresses this gap by introducing COSMMIC, a pioneering comment-sensitivemultimodal, multilingual dataset featuring nine major Indian languages. COSMMICcomprises 4,959 article-image pairs and 24,484 reader comments, withground-truth summaries available in all included languages. Our approachenhances summaries by integrating reader insights and feedback. We exploresummarization and headline generation across four configurations: (1) usingarticle text alone, (2) incorporating user comments, (3) utilizing images, and(4) combining text, comments, and images. To assess the dataset'seffectiveness, we employ state-of-the-art language models such as LLama3 andGPT-4. We conduct a comprehensive study to evaluate different componentcombinations, including identifying supportive comments, filtering out noiseusing a dedicated comment classifier using IndicBERT, and extracting valuableinsights from images with a multilingual CLIP-based classifier. This helpsdetermine the most effective configurations for natural language generation(NLG) tasks. Unlike many existing datasets that are either text-only or lackuser comments in multimodal settings, COSMMIC uniquely integrates text, images,and user feedback. This holistic approach bridges gaps in Indian languageresources, advancing NLP research and fostering inclusivity.

Quick Read (beta)

loading the full paper ...