Factorization of Fact-Checks for Low Resource Indian Languages

  • 2021-02-23 16:47:41
  • Shivangi Singhal, Rajiv Ratn Shah, Ponnurangam Kumaraguru
  • 2

Abstract

The advancement in technology and accessibility of internet to eachindividual is revolutionizing the real time information. The liberty to expressyour thoughts without passing through any credibility check is leading todissemination of fake content in the ecosystem. It can have disastrous effectson both individuals and society as a whole. The amplification of fake news isbecoming rampant in India too. Debunked information often gets republished witha replacement description, claiming it to depict some different incidence. Tocurb such fabricated stories, it is necessary to investigate such deduplicatesand false claims made in public. The majority of studies on automaticfact-checking and fake news detection is restricted to English only. But for acountry like India where only 10% of the literate population speak English,role of regional languages in spreading falsity cannot be undermined. In thispaper, we introduce FactDRIL: the first large scale multilingual Fact-checkingDataset for Regional Indian Languages. We collect an exhaustive dataset across7 months covering 11 low-resource languages. Our propose dataset consists of9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222samples are distributed across various regional languages, i.e. Bangla,Marathi, Malayalam, Telugu, Tamil, Oriya, Assamese, Punjabi, Urdu, Sinhala andBurmese. We also present the detailed characterization of three M's(multi-lingual, multi-media, multi-domain) in the FactDRIL accompanied with thecomplete list of other varied attributes making it a unique dataset to study.Lastly, we present some potential use cases of the dataset. We expect thisdataset will be a valuable resource and serve as a starting point to fightproliferation of fake news in low resource languages.

 

Quick Read (beta)

loading the full paper ...