Linguistic Universals: Language-independent semantic fingerprints

Abstract

Finding out the meaning of words in context, as a central task in thesemantic processing of natural languages, exhibits a data-size discrepancy:Machines require much larger amount of verbal training than average humans,before they can interpret information and acquire knowledge. Using a Markovmodel, we assign language-independent semantic fingerprints to words in aparticular document of moderate length, without consulting externalknowledge-base or thesaurus. Instead of embedding words into very highdimensional spaces, we represent each concept by a few dozen parameters,interpretable as algebraic invariants in succinct statistical operations onlocal environments of individual words. These semantic representations enable arobot reader to both understand short texts in a given language (automatedquestion-answering) and match medium-length texts across different languages(automated word translation). Our semantic fingerprints quantify local meaningof words in 14 representative languages across 5 major language families,suggesting a universal and cost-effective mechanism by which human languagesare processed at the semantic level.

Quick Read (beta)

loading the full paper ...