Language statistics at different spatial, temporal, and grammatical scales

Abstract

Statistical linguistics has advanced considerably in recent decades as datahas become available. This has allowed researchers to study how statisticalproperties of languages change over time. In this work, we use data fromTwitter to explore English and Spanish considering the rank diversity atdifferent scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to3000+km radii), and grammatical (from monograms to pentagrams). We find thatall three scales are relevant. However, the greatest changes come fromvariations in the grammatical scale. At the lowest grammatical scale(monograms), the rank diversity curves are most similar, independently on thevalues of other scales, languages, and countries. As the grammatical scalegrows, the rank diversity curves vary more depending on the temporal andspatial scales, as well as on the language and country. We also study thestatistics of Twitter-specific tokens: emojis, hashtags, and user mentions.These particular type of tokens show a sigmoid kind of behaviour as a rankdiversity function. Our results are helpful to quantify aspects of languagestatistics that seem universal and what may lead to variations.

Quick Read (beta)

loading the full paper ...