Language comparison via network topology

  • 2019-07-16 11:33:04
  • Blaž Škrlj, Senja Pollak
  • 33

Abstract

Modeling relations between languages can offer understanding of languagecharacteristics and uncover similarities and differences between languages.Automated methods applied to large textual corpora can be seen as opportunitiesfor novel statistical studies of language development over time, as well as forimproving cross-lingual natural language processing techniques. In this work,we first propose how to represent textual data as a directed, weighted networkby the text2net algorithm. We next explore how various fast,network-topological metrics, such as network community structure, can be usedfor cross-lingual comparisons. In our experiments, we employ eight differentnetwork topology metrics, and empirically showcase on a parallel corpus, howthe methods can be used for modeling the relations between nine selectedlanguages. We demonstrate that the proposed method scales to large corporaconsisting of hundreds of thousands of aligned sentences on an of-the-shelflaptop. We observe that on the one hand properties such as communities, capturesome of the known differences between the languages, while others can be seenas novel opportunities for linguistic studies.

 

Quick Read (beta)

loading the full paper ...