A mathematical model for universal semantics

Abstract

We characterize the meaning of words with language-independent numericalfingerprints, through a mathematical analysis of recurring patterns in texts.Approximating texts by Markov processes on a long-range time scale, we are ableto extract topics, discover synonyms, and sketch semantic fields from aparticular document of moderate length, without consulting externalknowledge-base or thesaurus. Our Markov semantic model allows us to representeach topical concept by a low-dimensional vector, interpretable as algebraicinvariants in succinct statistical operations on the document, targeting localenvironments of individual words. These language-independent semanticrepresentations enable a robot reader to both understand short texts in a givenlanguage (automated question-answering) and match medium-length texts acrossdifferent languages (automated word translation). Our semantic fingerprintsquantify local meaning of words in 14 representative languages across 5 majorlanguage families, suggesting a universal and cost-effective mechanism by whichhuman languages are processed at the semantic level. Our protocols and sourcecodes are publicly available onhttps://github.com/yajun-zhou/linguae-naturalis-principia-mathematica

Quick Read (beta)

loading the full paper ...