Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce

  • 2024-10-17 14:32:23
  • Nedjma Ousidhoum, Meriem Beloucif, Saif M. Mohammad
  • 0

Abstract

Language is a symbolic capital that affects people's lives in many ways(Bourdieu, 1977, 1991). It is a powerful tool that accounts for identities,cultures, traditions, and societies in general. Hence, data in a given languageshould be viewed as more than a collection of tokens. Good data collection andlabeling practices are key to building more human-centered and socially awaretechnologies. While there has been a rising interest in mid- to low-resourcelanguages within the NLP community, work in this space has to overcome uniquechallenges such as data scarcity and access to suitable annotators. In thispaper, we collect feedback from those directly involved in and impacted by NLPartefacts for mid- to low-resource languages. We conduct a quantitative andqualitative analysis of the responses and highlight the main issues related to(1) data quality such as linguistic and cultural data suitability; and (2) theethics of common annotation practices such as the misuse of online communityservices. Based on these findings, we make several recommendations for thecreation of high-quality language artefacts that reflect the cultural milieu ofits speakers, while simultaneously respecting the dignity and labor of dataworkers.

 

Quick Read (beta)

loading the full paper ...