Mukhyansh: A Headline Generation Dataset for Indic Languages

Abstract

The task of headline generation within the realm of Natural LanguageProcessing (NLP) holds immense significance, as it strives to distill the trueessence of textual content into concise and attention-grabbing summaries. Whilenoteworthy progress has been made in headline generation for widely spokenlanguages like English, there persist numerous challenges when it comes togenerating headlines in low-resource languages, such as the rich and diverseIndian languages. A prominent obstacle that specifically hinders headlinegeneration in Indian languages is the scarcity of high-quality annotated data.To address this crucial gap, we proudly present Mukhyansh, an extensivemultilingual dataset, tailored for Indian language headline generation.Comprising an impressive collection of over 3.39 million article-headlinepairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu,Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present acomprehensive evaluation of several state-of-the-art baseline models.Additionally, through an empirical analysis of existing works, we demonstratethat Mukhyansh outperforms all other models, achieving an impressive averageROUGE-L score of 31.43 across all 8 languages.

Quick Read (beta)

loading the full paper ...