Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

  • 2022-01-25 03:05:23
  • Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite
  • 0

Abstract

In recent years, large-scale data collection efforts have prioritized theamount of data collected in order to improve the modeling capabilities of largelanguage models. This prioritization, however, has resulted in concerns withrespect to the rights of data subjects represented in data collections,particularly when considering the difficulty in interrogating these collectionsdue to insufficient documentation and tools for analysis. Mindful of thesepitfalls, we present our methodology for a documentation-first, human-centereddata collection project as part of the BigScience initiative. We identified ageographically diverse set of target language groups (Arabic, Basque, Chinese,Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages,Portuguese, Spanish, and Vietnamese, as well as programming languages) forwhich to collect metadata on potential data sources. To structure this effort,we developed our online catalogue as a supporting tool for gathering metadatathrough organized public hackathons. We present our development process;analyses of the resulting resource metadata, including distributions overlanguages, regions, and resource types; and our lessons learned in thisendeavor.

 

Quick Read (beta)

loading the full paper ...