Emerging Language Spaces Learned From Massively Multilingual Corpora

  • 2018-02-01 12:58:16
  • Jörg Tiedemann
Translations capture important information about languages that can be usedas implicit supervision in learning linguistic properties and semanticrepresentations. In an information-centric view, translated texts may beconsidered as semantic mirrors of the original text and the significantvariations that we can observe across various languages can be used todisambiguate a given expression using the linguistic signal that is grounded intranslation. Parallel corpora consisting of massive amounts of humantranslations with a large linguistic variation can be applied to increaseabstractions and we propose the use of highly multilingual machine translationmodels to find language-independent meaning representations. Our initialexperiments show that neural machine translation models can indeed learn insuch a setup and we can show that the learning algorithm picks up informationabout the relation between languages in order to optimize transfer leaning withshared parameters. The model creates a continuous language space thatrepresents relationships in terms of geometric distances, which we canvisualize to illustrate how languages cluster according to language familiesand groups. Does this open the door for new ideas of data-driven languagetypology with promising models and techniques in empirical cross-linguisticresearch?


