Abstract
Large Language Models (LLMs) have rapidly increased in size and apparentcapabilities in the last three years, but their training data is largelyEnglish text. There is growing interest in multilingual LLMs, and variousefforts are striving for models to accommodate languages of communities outsideof the Global North, which include many languages that have been historicallyunderrepresented in digital realms. These languages have been coined as "lowresource languages" or "long-tail languages", and LLMs performance on theselanguages is generally poor. While expanding the use of LLMs to more languagesmay bring many potential benefits, such as assisting cross-communitycommunication and language preservation, great care must be taken to ensurethat data collection on these languages is not extractive and that it does notreproduce exploitative practices of the past. Collecting data from languagesspoken by previously colonized people, indigenous people, and non-Westernlanguages raises many complex sociopolitical and ethical questions, e.g.,around consent, cultural safety, and data sovereignty. Furthermore, linguisticcomplexity and cultural nuances are often lost in LLMs. This position paperbuilds on recent scholarship, and our own work, and outlines several relevantsocial, cultural, and ethical considerations and potential ways to mitigatethem through qualitative research, community partnerships, and participatorydesign approaches. We provide twelve recommendations for consideration whencollecting language data on underrepresented language communities outside ofthe Global North.