Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Abstract

In recent years, large-scale data collection efforts have prioritized theamount of data collected in order to improve the modeling capabilities of largelanguage models. This prioritization, however, has resulted in concerns withrespect to the rights of data subjects represented in data collections,particularly when considering the difficulty in interrogating these collectionsdue to insufficient documentation and tools for analysis. Mindful of thesepitfalls, we present our methodology for a documentation-first, human-centereddata collection project as part of the BigScience initiative. We identified ageographically diverse set of target language groups (Arabic, Basque, Chinese,Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages,Portuguese, Spanish, and Vietnamese, as well as programming languages) forwhich to collect metadata on potential data sources. To structure this effort,we developed our online catalogue as a supporting tool for gathering metadatathrough organized public hackathons. We present our development process;analyses of the resulting resource metadata, including distributions overlanguages, regions, and resource types; and our lessons learned in thisendeavor.

Quick Read (beta)

loading the full paper ...