### Abstract

We characterize the meaning of words with language-independent numericalfingerprints, through a mathematical analysis of recurring patterns in texts.Approximating texts by Markov processes on a long-range time scale, we are ableto extract topics, discover synonyms, and sketch semantic fields from aparticular document of moderate length, without consulting externalknowledge-base or thesaurus. Our Markov semantic model allows us to representeach topical concept by a low-dimensional vector, interpretable as algebraicinvariants in succinct statistical operations on the document, targeting localenvironments of individual words. These language-independent semanticrepresentations enable a robot reader to both understand short texts in a givenlanguage (automated question-answering) and match medium-length texts acrossdifferent languages (automated word translation). Our semantic fingerprintsquantify local meaning of words in 14 representative languages across 5 majorlanguage families, suggesting a universal and cost-effective mechanism by whichhuman languages are processed at the semantic level. Our protocols and sourcecodes are publicly available onhttps://github.com/yajun-zhou/linguae-naturalis-principia-mathematica

### Quick Read (beta)

# A mathematical model for universal semantics

###### Abstract

We characterize the meaning of words with language-independent numerical fingerprints, through a mathematical analysis of recurring patterns in texts. Approximating texts by Markov processes on a long-range time scale, we are able to extract topics, discover synonyms, and sketch semantic fields from a particular document of moderate length, without consulting external knowledge-base or thesaurus. Our Markov semantic model allows us to represent each topical concept by a low-dimensional vector, interpretable as algebraic invariants in succinct statistical operations on the document, targeting local environments of individual words. These language-independent semantic representations enable a robot reader to both understand short texts in a given language (automated question-answering) and match medium-length texts across different languages (automated word translation). Our semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a universal and cost-effective mechanism by which human languages are processed at the semantic level. Our protocols and source codes are publicly available on https://github.com/yajun-zhou/linguae-naturalis-principia-mathematica

>=stealth \pgfplotssetscaled y ticks=false

## 1 Introduction

A quantitative model for the meaning of words helps us understand how we transmit information and absorb knowledge. Ideally, a universal mechanism of semantics should be based on numerical characteristics of human languages, transcending concrete written and spoken forms of verbal messages. In this work, we demonstrate, in both theory and practice, that the time structure of recurring language patterns is a good candidate for such a universal semantic mechanism. Through statistical analysis of recurrence times and hitting times, we numerically characterize connectivity and association of individual concepts, thereby devising language-independent semantic fingerprints (LISF).

Akin to the physical world, there is a hierarchy of length scales in languages. On short scales such as syllables, words, and phrases, human languages do not exhibit a universal pattern related to semantics. Except for a few onomatopoeias, the sounds of words do not affect their meaning [1]. Neither do morphological parameters [2] (say, singular/plural, present/past) or syntactic rôles [3] (say, subject/object, active/passive). In short, there are no universal semantic mechanisms at the phonological, lexical or syntactical levels [4]. Grammatical “rules and principles” [2, 3], however typologically diverse, play no definitive rôle in determining the inherent meaning of a word.

Motivated by the observations above, we will build our quantitative semantic model on long-range and language-independent textual features. Specifically, we will measure the lengths of text fragments flanked by word patterns of interest (Fig. 1). Here, a word pattern is a collection of content words that are identical up to morphological parameters and syntactic rôles. A content word signifies definitive concepts (like apple, eat, red), instead of serving purely grammatical or logical functions (like but, of, the). Fragment length statistics will tell us how tightly/loosely one concept is connected to another. This in turn, will provide us with quantitative criteria for inclusion/exclusion of different concepts within the same (computationally constructed) semantic field. Such statistical semantic mining will then pave the way for machine comprehension and machine translation.

## 2 Methodology

We quantify the time structure of an individual word pattern ${\U0001d5b6}_{i}$ through the statistics of its recurrence times ${\tau}_{ii}$. We characterize the dynamic impact of a word pattern ${\U0001d5b6}_{i}$ on another word pattern ${\U0001d5b6}_{j}$ by the statistics of their hitting times ${\tau}_{ij}$. In what follows, we will describe the statistical analyses of ${\tau}_{ii}$ and ${\tau}_{ij}$, on which we build a language-independent Markov model for semantics.

### 2.1 Recurrence times and topicality

Assuming uniform
reading speed,^{1}^{1}
1
On the scale of words (rather than phonemes), this assumption works fine in most languages that are written alphabetically. However, this working hypothesis does not extend to Japanese texts, which interlace Japanese syllabograms (lasting one mora per written unit) with Chinese ideograms (lasting one or more morae per written unit). we measure the recurrence times ${\tau}_{ii}$ for a word pattern ${\U0001d5b6}_{i}$ through ${n}_{ii}$ samples of the effective fragment lengths ${L}_{ii}$ (Figs. 1, 2a).
Here, while counting as in Fig. 1, we
ignore
contacts between short-range neighbors, which may involve language-dependent redundancies.^{2}^{2}
2
For example, a German phrase liebe Studentinnen und Studenten with short-range recurrence is the gender-inclusive equivalent of the English expression dear students. Some Austronesian languages (such as Malay and Hawaiian) use reduplication for plurality or emphasis.

#### 2.1.1 Recurrence of non-topical patterns

In a memoryless (hence banal) Poisson process (Fig. 2b), recurrence times are exponentially distributed (Fig. 2d,d${}^{\prime}$). The same is also true for word recurrence in a randomly reshuffled text [5]. If we have ${n}_{ii}$ independent samples of exponentially distributed random variables ${L}_{ii}$, then the statistic ${\delta}_{i}:=\mathrm{log}\u27e8{L}_{ii}\u27e9-\u27e8\mathrm{log}{L}_{ii}\u27e9-{\gamma}_{0}+\frac{1}{2{n}_{ii}}$ satisfies an inequality

$$ | (1) |

with probability 95% (see Theorem 1 in Appendix A for a two-sigma rule). Here, ${\gamma}_{0}:={lim}_{n\to \mathrm{\infty}}\left(-\mathrm{log}n+{\sum}_{m=1}^{n}\frac{1}{m}\right)$ is the Euler–Mascheroni constant.

As a working definition, we consider a word pattern ${\U0001d5b6}_{i}$ non-topical if its ${n}_{ii}$ counts of effective fragment lengths ${L}_{ii}$ are exponentially distributed $\mathbb{P}({L}_{ii}>t)\sim {e}^{-kt}$, within 95% margins of error [that is, satisfying (1) above].

#### 2.1.2 Recurrence of topical patterns

In contrast, we consider a word pattern ${\U0001d5b6}_{i}$ topical if its diagonal statistics ${n}_{ii},{L}_{ii}$ constitute significant departure from the Poissonian line $\u27e8\mathrm{log}{L}_{ii}\u27e9-\mathrm{log}\u27e8{L}_{ii}\u27e9+{\gamma}_{0}=0$ (Fig. 2e, blue line), violating the bound in (1).

Notably, most data points for topics (colored dots on Fig. 2e) in Jane Austen’s Pride and Prejudice mark systematic downward departures from the Poissonian line. This suggests that the topical recurrence times $\tau ={L}_{ii}$ follow weighted mixtures of exponential distributions (Fig. 2c,c${}^{\prime}$):

$\mathbb{P}(\tau >t)\sim {\displaystyle \sum _{m}}{c}_{m}{e}^{-{k}_{m}t},$ | (2) |

$\left(\text{where}{c}_{m},{k}_{m}0,\text{and}{\sum}_{m}{c}_{m}=1\right)$, which impose an inequality constraint on the recurrence time $\tau ={L}_{ii}$:

$\u27e8\mathrm{log}{L}_{ii}\u27e9-\mathrm{log}\u27e8{L}_{ii}\u27e9+{\gamma}_{0}$ | ||||

$=$ | $\sum _{m}}{c}_{m}\mathrm{log}{\displaystyle \frac{1}{{k}_{m}}}-\mathrm{log}{\displaystyle \sum _{m}}{\displaystyle \frac{{c}_{m}}{{k}_{m}}}\le 0.$ | (3) |