Mukhyansh: A Headline Generation Dataset for Indic Languages

  • 2023-11-29 15:49:24
  • Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava
  • 0

Abstract

The task of headline generation within the realm of Natural LanguageProcessing (NLP) holds immense significance, as it strives to distill the trueessence of textual content into concise and attention-grabbing summaries. Whilenoteworthy progress has been made in headline generation for widely spokenlanguages like English, there persist numerous challenges when it comes togenerating headlines in low-resource languages, such as the rich and diverseIndian languages. A prominent obstacle that specifically hinders headlinegeneration in Indian languages is the scarcity of high-quality annotated data.To address this crucial gap, we proudly present Mukhyansh, an extensivemultilingual dataset, tailored for Indian language headline generation.Comprising an impressive collection of over 3.39 million article-headlinepairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu,Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present acomprehensive evaluation of several state-of-the-art baseline models.Additionally, through an empirical analysis of existing works, we demonstratethat Mukhyansh outperforms all other models, achieving an impressive averageROUGE-L score of 31.43 across all 8 languages.

 

Quick Read (beta)

loading the full paper ...