Mapping Languages: The Corpus of Global Language Use

Abstract

This paper describes a web-based corpus of global language use with a focuson how this corpus can be used for data-driven language mapping. First, thecorpus provides a representation of where national varieties of major languagesare used (e.g., English, Arabic, Russian) together with consistently collecteddata for each variety. Second, the paper evaluates a language identificationmodel that supports more local languages with smaller sample sizes thanalternative off-the-shelf models. Improved language identification is essentialfor moving beyond majority languages. Given the focus on language mapping, thepaper analyzes how well this digital language data represents actualpopulations by (i) systematically comparing the corpus with demographicground-truth data and (ii) triangulating the corpus with an alternateTwitter-based dataset. In total, the corpus contains 423 billion wordsrepresenting 148 languages (with over 1 million words from each language) and158 countries (again with over 1 million words from each country), alldistilled from Common Crawl web data. The main contribution of this paper, inaddition to describing this publicly-available corpus, is to provide acomprehensive analysis of the relationship between two sources of digital data(the web and Twitter) as well as their connection to underlying populations.

Quick Read (beta)

loading the full paper ...