Abstract
The goal of this paper is to provide a complete representation of regionallinguistic variation on a global scale. To this end, the paper focuses onremoving three constraints that have previously limited work withindialectology/dialectometry. First, rather than assuming a fixed and incompleteset of variants, we use Computational Construction Grammar to provide areplicable and falsifiable set of syntactic features. Second, rather thanassuming a specific area of interest, we use global language mapping based onweb-crawled and social media datasets to determine the selection of nationalvarieties. Third, rather than looking at a single language in isolation, wemodel seven major languages together using the same methods: Arabic, English,French, German, Portuguese, Russian, and Spanish. Results show that models foreach language are able to robustly predict the region-of-origin of held-outsamples better using Construction Grammars than using simpler syntacticfeatures. These global-scale experiments are used to argue that new methods incomputational sociolinguistics are able to provide more generalized models ofregional variation that are essential for understanding language variation andchange at scale.