Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

Abstract

This paper measures similarity both within and between 84 language varietiesacross nine languages. These corpora are drawn from digital sources (the weband tweets), allowing us to evaluate whether such geo-referenced corpora arereliable for modelling linguistic variation. The basic idea is that, if eachsource adequately represents a single underlying language variety, then thesimilarity between these sources should be stable across all languages andcountries. The paper shows that there is a consistent agreement between thesesources using frequency-based corpus similarity measures. This provides furtherevidence that digital geo-referenced corpora consistently represent locallanguage varieties.

Quick Read (beta)

loading the full paper ...