The Sociolinguistic Foundations of Language Modeling

  • 2024-07-12 14:12:55
  • Jack Grieve, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, Bodo Winter
In this paper, we introduce a sociolinguistic perspective on languagemodeling. We claim that large language models are inherently models ofvarieties of language, and we consider how this insight can inform thedevelopment and deployment of large language models. We begin by presenting atechnical definition of the concept of a variety of language as developed insociolinguistics. We then discuss how this perspective can help address fivebasic challenges in language modeling: social bias, domain adaptation,alignment, language change, and scale. Ultimately, we argue that it is crucialto carefully define and compile training corpora that accurately represent thespecific varieties of language being modeled to maximize the performance andsocietal value of large language models.


