Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

  • 2021-04-03 02:19:46
  • Jonathan Dunn
  • 0


This paper measures similarity both within and between 84 language varietiesacross nine languages. These corpora are drawn from digital sources (the weband tweets), allowing us to evaluate whether such geo-referenced corpora arereliable for modelling linguistic variation. The basic idea is that, if eachsource adequately represents a single underlying language variety, then thesimilarity between these sources should be stable across all languages andcountries. The paper shows that there is a consistent agreement between thesesources using frequency-based corpus similarity measures. This provides furtherevidence that digital geo-referenced corpora consistently represent locallanguage varieties.


Quick Read (beta)

loading the full paper ...