Finding out the meaning of words in context, as a central task in thesemantic processing of natural languages, exhibits a data-size discrepancy:Machines require much larger amount of verbal training than average humans,before they can interpret information and acquire knowledge. Using a Markovmodel, we assign language-independent semantic fingerprints to words in aparticular document of moderate length, without consulting externalknowledge-base or thesaurus. Instead of embedding words into very highdimensional spaces, we represent each concept by a few dozen parameters,interpretable as algebraic invariants in succinct statistical operations onlocal environments of individual words. These semantic representations enable arobot reader to both understand short texts in a given language (automatedquestion-answering) and match medium-length texts across different languages(automated word translation). Our semantic fingerprints quantify local meaningof words in 14 representative languages across 5 major language families,suggesting a universal and cost-effective mechanism by which human languagesare processed at the semantic level.
Quick Read (beta)
Linguistic Universals: Language-independent semantic fingerprints
Weinan E${}^{1,2\ast}$ Yajun Zhou${}^{2\ast}$
\tikzset
>=stealth
\pgfplotssetscaled y ticks=false
Finding out the meaning of words in context, as a central task in the semantic processing of natural languages,
exhibits a data-size discrepancy: Machines require much larger amount of verbal training than average humans, before they can interpret information and acquire knowledge.
Using a Markov model, we assign language-independent semantic fingerprints to words in a particular document of moderate length, without consulting external knowledge-base or thesaurus. Instead of embedding words into very high dimensional spaces, we represent each concept by a few dozen parameters, interpretable as algebraic invariants in succinct statistical operations on local environments of individual words. These semantic representations enable a robot reader to both understand short texts in a given language (automated question-answering) and match medium-length texts across different languages (automated word translation). Our semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a universal and cost-effective mechanism by which human languages are processed at the semantic level.
Semantic processing (?) ensures accuracy in monolingual communications and minimizes loss in cross-lingual translations. Unlike phonology (?, ?, ?), morphology (?, ?, ?, ?), syntax (?, ?, ?, ?, ?), among other aspects (?) of human languages, the mechanism of semantics is a less-charted territory. Data-hungry algorithms in machine learning achieve impressive success in some tasks of document comprehension (?, ?), through high-dimensional numerical representations of words and phrases (?, ?). To fill the data-size gap between humans and machines, we will devise language-independent semantic fingerprints (LISF) to numerically characterize connectivity and association of individual concepts, even with scant input of verbal information.^{†}^{†}${}^{1}$Department of Mathematics & Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, USA. ${}^{2}$Beijing Institute of Big Data Research, Beijing 100871, P. R. China.
${}^{\ast}$Corresponding author. E-mail: [email protected] (W.E), [email protected] (Y.Z.)
Like the physical world, there is a hierarchy of length scales in languages. On short scales such as syllables, words, and phrases, human languages do not exhibit a universal pattern related to semantics. Except for a few onomatopoeias, the sounds of words do not affect their meaning (?). Neither do morphological parameters (?) (say, singular/plural, present/past) or syntactic rôles (?) (say, subject/object, active/passive). In short, there are no universal semantic mechanisms at the phonological, lexical or syntactical levels (?). Grammatical “rules and principles” (?, ?), however typologically diverse, play no definitive rôle in determining the inherent meaning of a word.
Motivated by the observations above, we will build our quantitative semantic model on long-range and language-independent textual features. Specifically, we will measure the lengths of text fragments flanked by word patterns of interest (Fig. 1A). Here, a word pattern is a collection of content words that are identical up to morphological parameters and syntactic rôles. A content word signifies definitive concepts (like apple, eat, red), instead of serving purely grammatical or logical functions (like but, of, the). Fragment length statistics will tell us how tightly/loosely one concept is connected to another. This in turn, will provide us with quantitative criteria for inclusion/exclusion of different concepts within the same (computationally constructed) semantic field. Such statistical semantic mining will then pave the way for machine comprehension and machine translation.
Usually, one can assume that a reader processes texts at roughly uniform
speed [ (?), section 1.1]. So, up to a constant scaling factor, the recurrence times for a word pattern ${\U0001d5b6}_{i}$ are approximately distributed as ${n}_{ii}$ samples of the effective fragment lengths ${L}_{ii}$ (Fig. 1A). Here, while counting as in Fig. 1A, we
ignore
contacts between short-range neighbors, which may involve language-dependent redundancies [ (?), section 1.2].
As a working definition, we consider a word pattern ${\U0001d5b6}_{i}$non-topical if its ${n}_{ii}$ counts of effective fragment lengths ${L}_{ii}$ are exponentially distributed, within 95% margins of error [ (?), section 1.3]. This definition hearkens back to the exponentially distributed recurrence times in a randomly reshuffled text (?), or a memoryless (hence banal) Poisson process (fig. S1B). In contrast, we consider a word pattern ${\U0001d5b6}_{i}$topical if its diagonal statistics ${n}_{ii},{L}_{ii}$ constitute significant departure from the Poissonian line (Fig. 1B, blue line). Notably, in Fig. 1B, most data points for topics in Jane Austen’s Pride and Prejudice mark systematic downward departures from the Poissonian line. This suggests that the topical recurrence times follow weighted mixtures of exponential distributions [ (?), section 1.3, fig. S1C,C${}^{\prime}$].
The diagonal statistics ${n}_{ii},{L}_{ii}$ (Fig. 1A) have enabled us to extract topics automatically (Fig. 1B). The off-diagonal statistics ${n}_{ij},{L}_{ij}$ (Fig. 1A) will allow us to determine how strongly one word pattern ${\U0001d5b6}_{i}$ binds to another word pattern ${\U0001d5b6}_{j}$: In an empirical Markov matrix $\mathbf{P}=({\mathbf{p}}_{\mathrm{\mathbf{i}\mathbf{j}}})$, the long-range transition rate ${p}_{ij}$ is estimated by ${n}_{ij}$ times the geometric mean of
$1/{L}_{ij}$ [ (?), section 1.3].
Moreover, the spectrum
$\sigma (\mathbf{P})$
(collection of eigenvalues) is approximately invariant against translations of texts (Fig. 1C), which can be explained by a matrix similarity transformation [ (?), section 1.4].
Later on, specializing such spectral invariance to individual topical patterns, we will be able to generate semantic fingerprints through a list of topic-specific and language-independent eigenvalues. Here, we will be particularly interested in recurrence eigenvalues of individual topical patterns [ (?), section 1.4], which correspond to multiple decay rates in the weighted mixtures of exponential distributions.
Unlike the single exponential decays associated to non-topical recurrence patterns, the multiple exponential decay modes will enable our robot reader to easily discern one topic from another.
In general, it is numerically challenging to recover multiple exponential decay modes from a limited amount of recurrence time measurements (?). However, in text processing, we can circumvent such difficulties by off-diagonal statistics ${n}_{ij}$ and ${L}_{ij}$ that provide semantic contexts for individual topical patterns.
To quantitatively define the semantic rôle of a topical pattern ${\U0001d5b6}_{i}$, we specify a local, directed, and weighted graph, corresponding to a localized Markov transition matrix ${\mathbf{P}}^{[\mathbf{i}]}$. To localize, we need to remove edges between two vertices ${\U0001d5b6}_{i}$ and ${\U0001d5b6}_{j}$,
when ${L}_{ij}$ and ${L}_{ji}$ are “long enough” relative to what one could naïvely expect from ${n}_{ij},{n}_{ji}$ and ${L}_{ii},{L}_{jj}$.
Here, for naïve expectation, we approximate the probability $\mathbb{P}(\u27e8\mathrm{log}{L}_{ij}\u27e9>\mathrm{\ell})$ by a Gaussian model ${\alpha}_{ij}(\mathrm{\ell})$ (colored curves in Fig. 2A) whose mean and variance are deducible from ${n}_{ij}$ and ${L}_{ii}$ [ (?), section 1.4]. The parameters in the Gaussian model are justified by detailed balance on an ergodic Markov chain, and become asymptotically exact if distinct word patterns are statistically independent (such as ${\alpha}_{13}$, ${\alpha}_{24}$, ${\alpha}_{31}$, ${\alpha}_{34}$ in Fig. 2A).
Empirically, we find that higher ${\alpha}_{ij}(\mathrm{\ell})$ scores point to closer affinities between word patterns (Fig. 2A), attributable to kinship (Elizabeth, Jane), courtship (Darcy, Elizabeth), disposition (Darcy, pride) and so on. Our robot reader automatically detects such affinities, without references other than the novel itself. Therefore, we can use the ${\alpha}_{ij}(\mathrm{\ell})$ scores as guides to numerical approximations of semantic fields, hereafter referred to as semantic cliques.
We invite a topical pattern ${\U0001d5b6}_{j}$ to the semantic clique ${\mathcal{S}}_{i}$ (Figs. 2A and B, insets) surrounding
${\U0001d5b6}_{i}$, if $\mathrm{min}\{{\alpha}_{ij}(\u27e8\mathrm{log}{L}_{ij}\u27e9),{\alpha}_{ji}(\u27e8\mathrm{log}{L}_{ji}\u27e9)\}>{\alpha}_{*}$ for a standard Gaussian threshold ${\alpha}_{*}:=\frac{1}{\sqrt{2\pi}}{\int}_{-\mathrm{\infty}}^{1}{e}^{-{x}^{2}/2}\mathit{d}x\approx 0.8413$. This operation emulates the brainstorming procedure of a human reader, who associates one word with another only when they stay much closer than two randomly picked words, according to his/her impression.
AC${\U0001d5b6}_{1}=\text{Eliza(}\mathrm{\varnothing}|\text{beth}|\text{beth\u2019s)},{\U0001d5b6}_{2}=\text{Darcy(}\mathrm{\varnothing}|\text{\u2019s)},{\U0001d5b6}_{3}=\text{pr(ide}|\text{ided}|\text{oud}|\text{oudly}|\text{oudest)},{\U0001d5b6}_{4}=\text{Jane(}\mathrm{\varnothing}|\text{\u2019s)}$B{tikzpicture}[scale=.8]
{axis}[xmin=-0.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=$-\mathrm{log}|\lambda ({\mathbf{R}}^{[\mathbf{i}]})|$,ylabel=Cumul. counts,ylabel style=yshift=-.2cm,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50
, minor y tick num = 4, minor x tick num = 1
]
\addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.32220795295265,1)(1.56877450953151,2)(2.05343577065995,3)(2.61911591176899,4)(2.78144473392766,5)(2.78144473392766,6)(3.11707459762707,7)(3.11707459762707,8)(3.25995308658605,9)(3.35237510365735,10)(3.76374692832456,11)(3.76374692832456,12)(4.00703628694193,13)(4.00703628694193,14)(4.15641192195518,15)(4.15641192195518,16)(4.24435241361338,17)(4.24435241361338,18)(4.47969541700018,19)(4.47969541700018,20)(4.53617709320394,21)(4.54884684145921,22)(4.54884684145921,23)(5.28429991388079,24)(5.45839630641384,25)(7.41424594439814,26);\addplot[const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.42468705544310,1)(1.91388684511641,2)(2.29733030100358,3)(2.66301554498268,4)(2.66301554498268,5)(3.09409423954336,6)(3.48046820854723,7)(3.80322578694118,8)(3.80322578694118,9)(3.85068167367354,10)(3.87569656670043,11)(3.87569656670043,12)(4.40793959315427,13)(4.40793959315427,14)(4.42646434780309,15)(4.42646434780309,16)(4.43152478531075,17)(4.47878576400518,18)(4.48812958313127,19)(4.48812958313127,20)(5.42834103810266,21)(5.42834103810266,22)(5.95899543326392,23)(5.95899543326392,24)(7.59402470747260,25);\addplot[const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.26171372634093,1)(1.88893338072425,2)(1.99329057312888,3)(2.44274759970787,4)(3.04341515963974,5)(3.18415044115211,6)(3.63388682122032,7)(3.63388682122032,8)(3.78700632107653,9)(3.78700632107653,10)(4.13561444559012,11)(4.38374323961263,12)(4.38374323961263,13)(4.50094057637412,14)(4.50094057637412,15)(4.63602092956864,16)(4.63602092956864,17)(5.04104386297606,18)(5.05664349700382,19)(5.05664349700382,20)(5.08747913725148,21)(5.08747913725148,22)(5.16238491628622,23)(5.16238491628622,24)(5.17158763706523,25)(5.17158763706523,26)(5.23056389303276,27)(5.23056389303276,28)(5.66886302133743,29)(5.66886302133743,30)(6.01579697360825,31)(6.01579697360825,32);\addplot[const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.22762850582409,1)(2.26050288321408,2)(2.26050288321408,3)(2.73886600232238,4)(3.05870527819150,5)(3.05870527819150,6)(3.27300384137315,7)(3.27300384137315,8)(3.42002404126779,9)(3.64631122138641,10)(3.81536423904743,11)(3.81536423904743,12)(3.82573476426487,13)(4.08516569232732,14)(4.08516569232732,15)(4.15021193981405,16)(4.15021193981405,17)(4.28304595006059,18)(4.28304595006059,19)(4.54701970877615,20)(4.63446882401989,21)(4.63446882401989,22)(4.92518442100042,23)(4.92518442100042,24)(4.95195210048509,25)(4.95195210048509,26)(5.34354311467649,27)(5.34354311467649,28)(5.93737642689069,29)(5.93737642689069,30)(7.18161792558578,31)(7.18161792558578,32);\addplot[const plot,thin,draw=blue] plot coordinates (0,0)(0.32220795295265,1)(1.56877450953151,2)(2.05343577065995,3)(2.61911591176899,4)(2.78144473392766,5)(2.78144473392766,6)(3.11707459762707,7)(3.11707459762707,8)(3.25995308658605,9)(3.35237510365735,10)(3.76374692832456,11)(3.76374692832456,12)(4.00703628694193,13);\addplot[const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.42468705544310,1)(1.91388684511641,2)(2.29733030100358,3)(2.66301554498268,4)(2.66301554498268,5)(3.09409423954336,6)(3.48046820854723,7)(3.80322578694118,8)(3.80322578694118,9)(3.85068167367354,10);\addplot[const plot,thin,draw=green] plot coordinates (0,0)(0.26171372634093,1)(1.88893338072425,2)(1.99329057312888,3)(2.44274759970787,4)(3.04341515963974,5)(3.18415044115211,6)(3.63388682122032,7)(3.63388682122032,8)(3.78700632107653,9)(3.78700632107653,10)(4.13561444559012,11)(4.38374323961263,12)(4.38374323961263,13);\addplot[const plot,thin,draw=red] plot coordinates (0,0)(0.22762850582409,1)(2.26050288321408,2)(2.26050288321408,3)(2.73886600232238,4)(3.05870527819150,5)(3.05870527819150,6)(3.27300384137315,7)(3.27300384137315,8)(3.42002404126779,9)(3.64631122138641,10)(3.81536423904743,11)(3.81536423904743,12)(3.82573476426487,13)(4.08516569232732,14)(4.08516569232732,15)(4.15021193981405,16);{tikzpicture}[scale=.8]
{axis}[yticklabels=,xmin=-.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=$-\mathrm{log}|\lambda ({\mathbf{R}}^{[\mathbf{i}]})|$,ylabel=,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50
, minor y tick num = 4, minor x tick num = 1
]
\addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.10685330578770,1)(2.34182889171998,2)(2.52161496341122,3)(2.58077702739777,4)(2.79517328598574,5)(3.24787174744998,6)(3.24787174744998,7)(3.58141808763777,8)(3.58141808763777,9)(3.77810167778609,10)(3.94711575688057,11)(3.99354169872394,12)(3.99354169872394,13)(4.00652691746312,14)(4.00652691746312,15)(4.02441164447511,16)(4.30573731427503,17)(4.30573731427503,18)(4.37401030869072,19)(4.74497203951153,20)(4.74497203951153,21)(4.78260211345399,22)(4.78260211345399,23)(4.80154438232183,24)(4.80154438232183,25)(4.87774758339092,26)(4.87774758339092,27)(5.09899332789938,28)(5.10293615110034,29)(5.10293615110034,30)(5.14288499586647,31)(5.43837726757906,32)(5.45884314463469,33)(5.45884314463469,34)(5.70115750083496,35)(6.69090889587401,36);\addplot[const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.13611000011202,1)(2.00976392890417,2)(2.71726525568887,3)(2.80086573012721,4)(3.00015087704195,5)(3.19099977683431,6)(3.35751807709522,7)(3.38714559131331,8)(3.49248353946481,9)(3.49248353946481,10)(3.54202638843791,11)(3.79717447508252,12)(3.79717447508252,13)(3.96621389826330,14)(3.96621389826330,15)(4.06514679091855,16)(4.06514679091855,17)(4.22819668144596,18)(4.22819668144596,19)(4.39611234743818,20)(4.39611234743818,21)(4.58288536220391,22)(4.58288536220391,23)(4.63531765668741,24)(4.63531765668741,25)(4.67308668625852,26)(4.67308668625852,27)(4.77291232844163,28)(4.96232385489969,29)(4.96232385489969,30)(5.19806601729738,31)(5.19806601729738,32)(5.25279908996314,33)(5.25279908996314,34)(5.86160578978962,35)(5.86160578978962,36);\addplot[const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.12947426313459,1)(2.12591385705612,2)(2.73503738745595,3)(2.96057704877546,4)(3.12862939551519,5)(3.21555419635342,6)(3.21555419635342,7)(3.37339463508671,8)(3.52434218331204,9)(3.52434218331204,10)(3.76285561735342,11)(3.80442663006108,12)(3.87058728494671,13)(3.87058728494671,14)(4.12311255923610,15)(4.12311255923610,16)(4.43355641111795,17)(4.43355641111795,18)(4.60185459675352,19)(4.60185459675352,20)(4.61689734885113,21)(4.73650476926929,22)(4.73650476926929,23)(4.77083049338317,24)(4.77083049338317,25)(4.93509867070674,26)(4.93509867070674,27)(4.97934802848736,28)(4.97934802848736,29)(5.20981706087254,30)(5.20981706087254,31)(5.53642088357288,32)(5.53642088357288,33)(6.18102425629127,34)(6.18102425629127,35)(6.22870960716267,36)(6.22870960716267,37);\addplot[const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.11392989752484,1)(2.22998150860765,2)(2.90646071204164,3)(2.97583803997370,4)(3.27653225262211,5)(3.34085172231751,6)(3.48786058242352,7)(3.48786058242352,8)(3.56381854562820,9)(3.65131478042790,10)(3.65131478042790,11)(3.82538361848595,12)(3.82538361848595,13)(4.03554500407929,14)(4.03554500407929,15)(4.09427885952181,16)(4.09427885952181,17)(4.28772684934941,18)(4.35663152905546,19)(4.35663152905546,20)(4.43211501600674,21)(4.43211501600674,22)(4.50150679807307,23)(4.50150679807307,24)(4.62796984207214,25)(4.62796984207214,26)(4.80442226561453,27)(4.80442226561453,28)(4.83208367714186,29)(4.83208367714186,30)(4.98555309376678,31)(4.98555309376678,32)(5.07099461131659,33)(5.07099461131659,34)(5.10189860328778,35)(5.10189860328778,36)(5.28003612322993,37)(5.46061695511284,38)(5.46061695511284,39)(5.81964108671420,40)(5.81964108671420,41)(5.82459557318953,42)(5.82459557318953,43)(9.37119008453853,44);\addplot[const plot,thin,draw=blue] plot coordinates (0,0)(0.10685330578770,1)(2.34182889171998,2)(2.52161496341122,3)(2.58077702739777,4)(2.79517328598574,5)(3.24787174744998,6)(3.24787174744998,7)(3.58141808763777,8)(3.58141808763777,9)(3.77810167778609,10)(3.94711575688057,11)(3.99354169872394,12)(3.99354169872394,13)(4.00652691746312,14)(4.00652691746312,15)(4.02441164447511,16)(4.30573731427503,17)(4.30573731427503,18)(4.37401030869072,19)(4.74497203951153,20);\addplot[const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.13611000011202,1)(2.00976392890417,2)(2.71726525568887,3)(2.80086573012721,4)(3.00015087704195,5)(3.19099977683431,6)(3.35751807709522,7)(3.38714559131331,8)(3.49248353946481,9)(3.49248353946481,10)(3.54202638843791,11)(3.79717447508252,12)(3.79717447508252,13)(3.96621389826330,14)(3.96621389826330,15)(4.06514679091855,16)(4.06514679091855,17);\addplot[const plot,thin,draw=green] plot coordinates (0,0)(0.12947426313459,1)(2.12591385705612,2)(2.73503738745595,3)(2.96057704877546,4)(3.12862939551519,5)(3.21555419635342,6)(3.21555419635342,7)(3.37339463508671,8)(3.52434218331204,9)(3.52434218331204,10)(3.76285561735342,11)(3.80442663006108,12)(3.87058728494671,13)(3.87058728494671,14)(4.12311255923610,15)(4.12311255923610,16);\addplot[const plot,thin,draw=red] plot coordinates (0,0)(0.11392989752484,1)(2.22998150860765,2)(2.90646071204164,3)(2.97583803997370,4)(3.27653225262211,5)(3.34085172231751,6)(3.48786058242352,7)(3.48786058242352,8)(3.56381854562820,9)(3.65131478042790,10)(3.65131478042790,11)(3.82538361848595,12)(3.82538361848595,13)(4.03554500407929,14)(4.03554500407929,15)(4.09427885952181,16)(4.09427885952181,17)(4.28772684934941,18)(4.35663152905546,19)(4.35663152905546,20)(4.43211501600674,21);{tikzpicture}[scale=.8]
{axis}[yticklabels=,xmin=-.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=$-\mathrm{log}|\lambda ({\mathbf{R}}^{[\mathbf{i}]})|$,ylabel=,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50
, minor y tick num = 4, minor x tick num = 1
]
\addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.03375045936627,1)(1.84339112866770,2)(2.04989352100626,3)(2.25344865520748,4)(2.73109506060704,5)(2.99663941125549,6)(3.25512778898816,7)(3.32365312299691,8)(3.32365312299691,9)(3.46282618345518,10)(3.46282618345518,11)(3.65393321900561,12)(3.65393321900561,13)(3.95174049263931,14)(4.00270682911275,15)(4.27842169773721,16)(4.27842169773721,17)(4.52962618522869,18)(4.52962618522869,19)(4.65850485399020,20)(4.66269224968636,21)(4.66269224968636,22)(4.86896088465735,23)(4.86896088465735,24);\addplot[const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.01337958226662,1)(1.23627718146471,2)(2.56078706446613,3)(2.65671196285956,4)(2.88596406752471,5)(3.05972637203348,6)(3.43206174427130,7)(3.70810818547555,8)(3.70810818547555,9)(3.72728394611638,10)(3.72728394611638,11)(4.43243048108062,12)(4.43243048108062,13)(4.46468420506806,14)(4.46468420506806,15)(4.67752220750291,16)(4.67752220750291,17)(5.11649724615875,18)(5.11649724615875,19)(5.59658036171709,20)(5.59658036171709,21);
\addplot[const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.02241646271159,1)(1.61053743998943,2)(2.06584195318460,3)(2.41340303393332,4)(2.84154690083292,5)(2.98507542500559,6)(2.98507542500559,7)(3.47403721759109,8)(3.47403721759109,9)(3.55871255413353,10)(3.55871255413353,11)(3.70826614565336,12)(3.80643800157878,13)(3.80643800157878,14)(4.15261240523967,15)(4.20851265325287,16)(4.20851265325287,17)(4.27118141523191,18)(4.27118141523191,19)(4.62713773689971,20)(4.62713773689971,21)(4.81634994746198,22)(4.93066411430782,23)(5.45149832787469,24)(5.90147619724694,25)(6.59536610798741,26);\addplot[const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.03331149807929,1)(1.50662538164952,2)(2.06396151127533,3)(2.40676779108242,4)(3.01358374902029,5)(3.01358374902029,6)(3.13965009917626,7)(3.13965009917626,8)(3.30291818059154,9)(3.49139131768527,10)(3.97845584868851,11)(3.97845584868851,12)(3.99454267644476,13)(3.99454267644476,14)(4.37719479245446,15)(4.45982096576618,16)(4.66172363432402,17)(4.96539893960865,18)(5.41411521574973,19)(5.41411521574973,20)(6.18257206271890,21)(6.18257206271890,22);\addplot[const plot,thin,draw=blue] plot coordinates (0,0)(0.03375045936627,1)(1.84339112866770,2)(2.04989352100626,3)(2.25344865520748,4)(2.73109506060704,5)(2.99663941125549,6)(3.25512778898816,7)(3.32365312299691,8)(3.32365312299691,9)(3.46282618345518,10)(3.46282618345518,11)(3.65393321900561,12)(3.65393321900561,13)(3.95174049263931,14);\addplot[const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.01337958226662,1)(1.23627718146471,2)(2.56078706446613,3)(2.65671196285956,4)(2.88596406752471,5)(3.05972637203348,6)(3.43206174427130,7);\addplot[const plot,thin,draw=green] plot coordinates (0,0)(0.02241646271159,1)(1.61053743998943,2)(2.06584195318460,3)(2.41340303393332,4)(2.84154690083292,5)(2.98507542500559,6)(2.98507542500559,7)(3.47403721759109,8)(3.47403721759109,9)(3.55871255413353,10)(3.55871255413353,11)(3.70826614565336,12)(3.80643800157878,13)(3.80643800157878,14);\addplot[const plot,thin,draw=red] plot coordinates (0,0)(0.03331149807929,1)(1.50662538164952,2)(2.06396151127533,3)(2.40676779108242,4)(3.01358374902029,5)(3.01358374902029,6)(3.13965009917626,7)(3.13965009917626,8)(3.30291818059154,9)(3.49139131768527,10)(3.97845584868851,11)(3.97845584868851,12);Indo-EuropeanDanishGermanDutchSpanishFrenchLatinPolishRussianKoreanicKoreanTurkicTurkishUralicFinnishHungarianVasconicBasque#Topics:0 20 40 60 80 100 120correctcloseincorrectDE
WikiQA-Q26: How did Anne Frank die?Reference: “Anne Frank” (Wikipedia)$\Rightarrow $$\Rightarrow $(1) AnneFrank and her sister, Margot , were eventually transferred to the Bergen-Belsenconcentrationcamp , where they died of typhus in March 1945.
(2) Annelies "Anne" Marie Frank (, ?, ; 12 June 1929early March 1945) is one of the most discussed Jewish victims of the Holocaust .(3) Otto Frank, the only survivor of the family, returned to Amsterdam after the war to find that Anne’s diary had been saved, and his efforts led to its publication in 1947.(4) As persecutions of the Jewish population increased in July 1942, the family went into hiding in the hidden rooms of Anne’s father, Otto Frank ’s, office building.(5) The Frank family moved from Germany to Amsterdam in 1933, the year the Nazis gained control over Germany.
MAP
MRR
0.6190
CNN
0.6281
0.6086
LISF${}^{{\mathrm{*}}}$
0.6263
0.5993
LCLR
0.6086
0.5897
LISF
0.6060
0.5110
PV
0.5160
0.4891
Word Count
0.4924
0.3913
Random Sort
0.3990
Fig. 2: . Semantic cliques and their applications. (A) Empirical distributions of $\u27e8\mathrm{log}{L}_{ij}\u27e9$ in Pride and Prejudice, as gray and colored dots with radii $\frac{1}{4\sqrt{{n}_{ij}}}$, compared to Gaussian model ${\alpha}_{ij}(\mathrm{\ell})$ (colored curves parametrized by [ (?), equations (1.10)–(1.11)]). The numerical samplings of ${\U0001d5b6}_{j}$’s exhaust all the textual patterns available in the novel, including topical word patterns, non-topical word patterns and function words. Only those textual patterns with over 40 occurrences are displayed as data points. Inset of each frame shows the semantic clique ${\mathcal{S}}_{i}$ surrounding topic ${\U0001d5b6}_{i}$ (painted in black), color-coded by the ${\alpha}_{ij}(\u27e8\mathrm{log}{L}_{ij}\u27e9)$ score. The areas of the bounding boxes for individual word patterns are proportional to the components of ${\bm{\pi}}^{[i]}$ (the equilibrium state of ${\mathbf{P}}^{[\mathbf{i}]}$).
(B) Distributions for the magnitudes of eigenvalues (LISF) in the recurrence matrices ${\mathbf{R}}^{[\mathbf{i}]}$, for three concepts from four versions of Pride and Prejudice. The color encoding for languages follows Fig. 1C.
The largest $\lfloor {e}^{{\eta}_{i}}\rfloor $ magnitudes of eigenvalues are displayed as solid lines, while the remaining terms are shown in dashed lines.
Inset of each frame shows the semantic clique ${\mathcal{S}}_{i}$, counterclockwise from top-left, in French, Russian and Finnish. (C) Yields from bipartite matching of LISF for topical words between the English original of Pride and Prejudice and its translations into 13 languages out of 5 language families. (D) A construction of semantic clique $\mathcal{Q}\cup {\mathcal{Q}}^{\prime}$ (based on $\mathcal{Q}=\{$Anne, Frank, die$\mathrm{\}}$) weighted by the PageRank equilibrium state $\stackrel{\mathbf{~}}{\bm{\pi}}$ and subsequent question-answering. Top 5 candidate answers, with punctuation and spacing as given by WikiQA, are shown with font sizes proportional to the entropy production score [ (?), equation (1.16)]. Here, the top-scoring sentence with highlighted background is the same as the official answer chosen by the WikiQA team. Like a human reader, our algorithm automatically detects the place “Bergen-Belsen concentration camp”, cause “typhus”, and year “1945” of Anne Frank’s death. (E) Evaluations of our model (LISF and LISF${}^{*}$) on the WikiQA data set, in comparison with established algorithms.
On a local graph with vertices ${\mathcal{S}}_{i}=\{{\U0001d5b6}_{{i}_{1}}={\U0001d5b6}_{i},$${\U0001d5b6}_{{i}_{2}},$$\mathrm{\dots},$${\U0001d5b6}_{{i}_{{N}_{i}}}\}$, we specify the connectivity of each directed edge by a localized Markov matrix ${\mathbf{P}}^{[\mathbf{i}]}={({\mathbf{p}}_{\mathrm{\mathbf{j}\mathbf{k}}}^{[\mathbf{i}]})}_{\mathrm{\U0001d7cf}\le \mathbf{j},\mathbf{k}\le {\mathbf{N}}_{\mathbf{i}}}$. This localized Markov matrix
is the row-wise normalization of an ${N}_{i}\times {N}_{i}$ subblock of $\mathbf{P}$ with the same set of vertices as ${\mathcal{S}}_{i}$. Resetting the entries ${p}_{1k}^{[i]}$ and ${p}_{j1}^{[i]}$ as zero, one arrives at the localized recurrence matrix ${\mathbf{R}}^{[\mathbf{i}]}$. We call ${\mathbf{R}}^{[\mathbf{i}]}$ a recurrence matrix, because one can use it to compute the distribution for
recurrence times to the Markov state
${\U0001d5b6}_{i}$ in ${\mathcal{S}}_{i}$.
Experimentally, we resolve the connectivity of an individual pattern
${\U0001d5b6}_{i}$ through the recurrence spectrum $\sigma ({\mathbf{R}}^{[\mathbf{i}]})$ (Fig. 2B).
The dominant eigenvalues
of ${\mathbf{R}}^{[\mathbf{i}]}$ are concept-specific while remaining nearly language-independent (a localized version of the invariance in Fig. 1C).
Such empirical evidence motivates us to define the language-independent semantic fingerprint (LISF) of a word pattern ${\U0001d5b6}_{i}$
by a descending list for the magnitudes of eigenvalues ${\mathbf{v}}_{\mathbf{i}}=(|{\lambda}_{\mathrm{\U0001d7cf}}({\mathbf{R}}^{[\mathbf{i}]})|,|{\lambda}_{\mathrm{\U0001d7d0}}({\mathbf{R}}^{[\mathbf{i}]})|,\mathrm{\dots})$, computable from its semantic clique ${\mathcal{S}}_{i}$. We zero-pad this vector from the $(\lfloor {e}^{{\eta}_{i}}\rfloor +1)$st component onwards, where
${\eta}_{i}$ is the entropy production rate (?) of the Markov matrix ${\mathbf{P}}^{[\mathbf{i}]}$, measured in nats per word. Via bipartite matching [ (?), section 1.5, fig. S6; figs. S11–S23] of word vectors ${\mathbf{v}}_{\mathbf{i}}$ across languages, our algorithm translates words from parallel texts at very high precision (Fig. 2C; tables S3–S4).
So far, our semantic cliques ${\mathcal{S}}_{i}$ (Figs. 2A and B, insets) pick up words by numerical brainstorming from ${\U0001d5b6}_{i}$. These cliques inform us about their center word ${\U0001d5b6}_{i}$, through several types of semantic relations, including, but not limited to
•
Synonyms (pride and vanity in English, orgeuil and fierté in French, etc.);
•
Temperaments (Elizabeth, a delightful girl, often laughs, corresponding to French verbs sourire and rire);
•
Co-references (e.g. Darcy as a personification of pride);
•
Causalities (such as pride based on fortune).
In the light of this, these semantic cliques ${\mathcal{S}}_{i}$ are useful in text comprehension and question answering. We can expand a set of question words $\mathcal{Q}$ into $\mathcal{Q}\cup {\mathcal{Q}}^{\prime}$, by bringing together the semantic cliques generated from a reference text by each and every question word [ (?), section 1.6].
A sample work flow is shown in Fig. 2D,
to illustrate how our rudimentary question-answering machine handles a query. To answer a question, we use a single Wikipedia page (without infoboxes and other structural data) as the only reference document and training source. Like a typical human reader of Wikipedia, our numerical associative reasoning generates a weighted set of nodes $\mathcal{Q}\cup {\mathcal{Q}}^{\prime}$ (presented graphically as a thought bubble in Fig. 2D), without the help of external stimuli or knowledge feed. Here, the relative weights [ (?), section 1.6] in the nodes of $\mathcal{Q}\cup {\mathcal{Q}}^{\prime}$ are computable from the PageRank algorithm (?).
We then test our semantic model (LISF in Fig. 2E) on all the 1242 questions in the WikiQA data set, each of which is accompanied by at least one correct
answer located in a designated Wikipedia page.
Our algorithm’s performance is roughly on par with LCLR and CNN benchmarks (?), improving upon the baseline by significant margin. This is perhaps remarkable, considering the relatively scant data at our disposal. Unlike the LCLR approach, our numerical discovery of synonyms does not draw on the WordNet database (?) or pre-existent corpora of question-answer pairs. Unlike the CNN method, we do not need pre-trained word2vec embeddings (?) as semantic input.
Moreover, our algorithm (LISF${}^{*}$ in Fig. 2E) performs slightly better on a subset of 990 questions that do not require quantitative cues (How large? How long? How many? How old? What became of? What happened to? What year? and so on). This indicates that our structural model fits associative reasoning better than rule-based reasoning (?), while imitating human behaviour in the presence of limited data.
In our current work, we define semantics through algebraic invariants that are concept-specific and language-independent. To construct such invariants, we develop a stochastic model that assigns a semantic fingerprint (list of recurrence eigenvalues) to each
concept via its long-range contexts. Consistently using a single Markov framework, we are able to extract topics (Fig. 1B; figs. S10–S23), translate topics (Figs. 1C, 2B and C; figs. S11–S23) and understand topics (Figs. 2A, D and E), through statistical mining of short and medium-length texts. In view of these three successful applications, we are probably close to a complete set of semantic invariants, after demystifying the long-range behaviour of human languages.
Thanks to the independence between semantics and syntax (?), our current model conveniently ignores the non-Markovian syntactic structures which are essential to fluent speech.
In the near future, we
hope to extend our framework further, to incorporate both Markovian and non-Markovian features across different ranges. The Mathematical Principles of Natural Languages, as we envision, must and will combine the statistical analysis of a Markov model with linguistic properties on shorter time scales that convey morphological (?, ?, ?, ?) and syntactical (?, ?, ?, ?, ?) information.
References and Notes
1.
A. D. Friederici, J. Bahlmann, S. Heim, R. I. Schubotz, A. Anwander, Proc.
Natl. Acad. Sci. USA103, 2458 (2006).
2.
M. A. Nowak, D. C. Krakauer, Proc. Natl. Acad. Sci. USA96, 8028
(1999).
3.
C. Everett, D. E. Blasí, S. G. Roberts, Proc. Natl. Acad. Sci. USA112, 1322 (2015).
4.
C. Everett, Frontiers in Psychology8, Article 1285 (2017).
5.
S. Pinker, Nature387, 547 (1997).
6.
W. D. Marslen-Wilson, L. K. Tyler, Nature387, 592 (1997).
7.
E. Lieberman, J.-B. Michel, J. Jackson, T. Tang, M. A. Nowak, Nature449, 713 (2007).
8.
M. G. Newberry, C. A. Ahern, R. Clark, J. B. Plotkin, Nature551,
223 (2017).
9.
S. Pinker, Nature404, 441 (2000).
10.
M. A. Nowak, J. B. Plotkin, V. A. A. Jansen, Nature404, 495
(2000).
11.
N. Chomsky, Syntactic Structures (Mouton de Gruyter, Berlin, Germany,
2002), second edn.
12.
M. Dunn, S. J. Greenhill, S. C. Levinson, R. D. Gray, Nature473,
79 (2011).
13.
A. D. Friederici, N. Chomsky, R. C. Berwick, A. Moro, J. J. Bolhuis, Nat.
Hum. Behav.1, 713 (2017).
14.
Y. Yang, W.-t. Yih, C. Meek, Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing (EMNLP) (Association for
Computational Linguistics, Lisbon, Portugal, 2015).
15.
V. Tshitoyan, et al., Nature571, 95 (2019).
16.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Advances in
Neural Information Processing Systems 26 (NIPS, La Jolla, CA, 2013), pp.
3111–3119.
17.
S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, Transactions of the
Association for Computational Linguistics4, 385 (2016).
18.
F. de Saussure, Cours de linguistique générale (Payot, Paris,
France, 1949), fifth edn.
19.
S. Pinker, A. Prince, Cognition28, 73 (1988).
20.
A. D. Friederici, Language Comprehension: A Biological Perspective,
A. D. Friederici, ed. (Springer, Berlin, Germany, 1999), chap. 9, pp.
265–304.
21.
Materials and methods are available as supplementary materials.
22.
J. P. Herrera, P. A. Pury, Eur. Phys. J. B63, 135 (2008).
23.
Y. Zhou, X. Zhuang, Biophys. J.91, 4045 (2006).
24.
T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley
Interscience, Hoboken, NJ, 2006), second edn.
25.
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking:
Bringing order to the web, Tech. rep., Stanford InfoLab (1999).
http://ilpubs.stanford.edu:8090/422/.
26.
C. Fellbaum, ed., WordNet: An Electronic Lexical Database (Language,
Speech, and Communication) (MIT Press, Cambridge, MA, 1998).
27.
S. A. Sloman, Psychol. Bull.119, 3 (1996).
Acknowledgements
We thank N. Chomsky and S. Pinker for their inputs on several problems of linguistics. We thank X. Sun for discussions on neural networks. We thank X. Wan, R. Yan and D. Zhao for their suggestions on experimental design, during the early stages of this work.
Author contributions: W. E, Y. Z. designed the research. Y.Z. collected multilingual data, developed algorithms, and conducted numerical experiments. W. E, Y. Z. analyzed data and wrote the paper. Competing interests: The authors declare no competing interests.