NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

  • 2022-08-01 17:55:04
  • Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, Ayu Purwarianti
  • 0

Abstract

At the center of the underlying issues that halt Indonesian natural languageprocessing (NLP) research advancement, we find data scarcity. Resources inIndonesian languages, especially the local ones, are extremely scarce andunderrepresented. Many Indonesian researchers do not publish their dataset.Furthermore, the few public datasets that we have are scattered acrossdifferent platforms, thus makes performing reproducible and data-centricresearch in Indonesian NLP even more arduous. Rising to this challenge, weinitiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowdstrives to provide the largest datasheets aggregation with standardized dataloading for NLP tasks in all Indonesian languages. By enabling open andcentralized access to Indonesian NLP resources, we hope NusaCrowd can tacklethe data scarcity problem hindering NLP progress in Indonesia and bring NLPpractitioners to move towards collaboration.

 

Quick Read (beta)

loading the full paper ...