Abstract
Despite progress in comment-aware multimodal and multilingual summarizationfor English and Chinese, research in Indian languages remains limited. Thisstudy addresses this gap by introducing COSMMIC, a pioneering comment-sensitivemultimodal, multilingual dataset featuring nine major Indian languages. COSMMICcomprises 4,959 article-image pairs and 24,484 reader comments, withground-truth summaries available in all included languages. Our approachenhances summaries by integrating reader insights and feedback. We exploresummarization and headline generation across four configurations: (1) usingarticle text alone, (2) incorporating user comments, (3) utilizing images, and(4) combining text, comments, and images. To assess the dataset'seffectiveness, we employ state-of-the-art language models such as LLama3 andGPT-4. We conduct a comprehensive study to evaluate different componentcombinations, including identifying supportive comments, filtering out noiseusing a dedicated comment classifier using IndicBERT, and extracting valuableinsights from images with a multilingual CLIP-based classifier. This helpsdetermine the most effective configurations for natural language generation(NLG) tasks. Unlike many existing datasets that are either text-only or lackuser comments in multimodal settings, COSMMIC uniquely integrates text, images,and user feedback. This holistic approach bridges gaps in Indian languageresources, advancing NLP research and fostering inclusivity.