Linguistic Universals: Language-independent semantic fingerprints

  • 2019-11-23 10:09:43
  • Weinan E, Yajun Zhou
  • 0

Abstract

Finding out the meaning of words in context, as a central task in thesemantic processing of natural languages, exhibits a data-size discrepancy:Machines require much larger amount of verbal training than average humans,before they can interpret information and acquire knowledge. Using a Markovmodel, we assign language-independent semantic fingerprints to words in aparticular document of moderate length, without consulting externalknowledge-base or thesaurus. Instead of embedding words into very highdimensional spaces, we represent each concept by a few dozen parameters,interpretable as algebraic invariants in succinct statistical operations onlocal environments of individual words. These semantic representations enable arobot reader to both understand short texts in a given language (automatedquestion-answering) and match medium-length texts across different languages(automated word translation). Our semantic fingerprints quantify local meaningof words in 14 representative languages across 5 major language families,suggesting a universal and cost-effective mechanism by which human languagesare processed at the semantic level.

 

Quick Read (beta)

Linguistic Universals: Language-independent semantic fingerprints

Weinan E1,2    Yajun Zhou2
\tikzset

>=stealth \pgfplotssetscaled y ticks=false

Finding out the meaning of words in context, as a central task in the semantic processing of natural languages, exhibits a data-size discrepancy: Machines require much larger amount of verbal training than average humans, before they can interpret information and acquire knowledge. Using a Markov model, we assign language-independent semantic fingerprints to words in a particular document of moderate length, without consulting external knowledge-base or thesaurus. Instead of embedding words into very high dimensional spaces, we represent each concept by a few dozen parameters, interpretable as algebraic invariants in succinct statistical operations on local environments of individual words. These semantic representations enable a robot reader to both understand short texts in a given language (automated question-answering) and match medium-length texts across different languages (automated word translation). Our semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a universal and cost-effective mechanism by which human languages are processed at the semantic level.

Semantic processing (?) ensures accuracy in monolingual communications and minimizes loss in cross-lingual translations. Unlike phonology (?, ?, ?), morphology (?, ?, ?, ?), syntax (?, ?, ?, ?, ?), among other aspects (?) of human languages, the mechanism of semantics is a less-charted territory. Data-hungry algorithms in machine learning achieve impressive success in some tasks of document comprehension (?, ?), through high-dimensional numerical representations of words and phrases (?, ?). To fill the data-size gap between humans and machines, we will devise language-independent semantic fingerprints (LISF) to numerically characterize connectivity and association of individual concepts, even with scant input of verbal information. 1Department of Mathematics & Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, USA. 2Beijing Institute of Big Data Research, Beijing 100871, P. R. China.
Corresponding author. E-mail: [email protected] (W.E), [email protected] (Y.Z.)

Like the physical world, there is a hierarchy of length scales in languages. On short scales such as syllables, words, and phrases, human languages do not exhibit a universal pattern related to semantics. Except for a few onomatopoeias, the sounds of words do not affect their meaning (?). Neither do morphological parameters (?) (say, singular/plural, present/past) or syntactic rôles (?) (say, subject/object, active/passive). In short, there are no universal semantic mechanisms at the phonological, lexical or syntactical levels (?). Grammatical “rules and principles” (?, ?), however typologically diverse, play no definitive rôle in determining the inherent meaning of a word.

Motivated by the observations above, we will build our quantitative semantic model on long-range and language-independent textual features. Specifically, we will measure the lengths of text fragments flanked by word patterns of interest (Fig. 1A). Here, a word pattern is a collection of content words that are identical up to morphological parameters and syntactic rôles. A content word signifies definitive concepts (like apple, eat, red), instead of serving purely grammatical or logical functions (like but, of, the). Fragment length statistics will tell us how tightly/loosely one concept is connected to another. This in turn, will provide us with quantitative criteria for inclusion/exclusion of different concepts within the same (computationally constructed) semantic field. Such statistical semantic mining will then pave the way for machine comprehension and machine translation.

Usually, one can assume that a reader processes texts at roughly uniform speed [ (?), section 1.1]. So, up to a constant scaling factor, the recurrence times for a word pattern 𝖶i are approximately distributed as nii samples of the effective fragment lengths Lii (Fig. 1A). Here, while counting as in Fig. 1A, we ignore contacts between short-range neighbors, which may involve language-dependent redundancies [ (?), section 1.2].

As a working definition, we consider a word pattern 𝖶i non-topical if its nii counts of effective fragment lengths Lii are exponentially distributed, within 95% margins of error [ (?), section 1.3]. This definition hearkens back to the exponentially distributed recurrence times in a randomly reshuffled text (?), or a memoryless (hence banal) Poisson process (fig. S1B). In contrast, we consider a word pattern 𝖶i topical if its diagonal statistics nii,Lii constitute significant departure from the Poissonian line (Fig. 1B, blue line). Notably, in Fig. 1B, most data points for topics in Jane Austen’s Pride and Prejudice mark systematic downward departures from the Poissonian line. This suggests that the topical recurrence times follow weighted mixtures of exponential distributions [ (?), section 1.3, fig. S1C,C].

The diagonal statistics nii,Lii (Fig. 1A) have enabled us to extract topics automatically (Fig. 1B). The off-diagonal statistics nij,Lij (Fig. 1A) will allow us to determine how strongly one word pattern 𝖶i binds to another word pattern 𝖶j: In an empirical Markov matrix 𝐏=(𝐩𝐢𝐣), the long-range transition rate pij is estimated by nij times the geometric mean of 1/Lij [ (?), section 1.3]. Moreover, the spectrum σ(𝐏) (collection of eigenvalues) is approximately invariant against translations of texts (Fig. 1C), which can be explained by a matrix similarity transformation [ (?), section 1.4]. Later on, specializing such spectral invariance to individual topical patterns, we will be able to generate semantic fingerprints through a list of topic-specific and language-independent eigenvalues. Here, we will be particularly interested in recurrence eigenvalues of individual topical patterns [ (?), section 1.4], which correspond to multiple decay rates in the weighted mixtures of exponential distributions.

A

𝖶i={happier, happily, happiness, happy}, 𝖶j={marriage, married, marry}

... LOREM IPSUM HAPPY DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING UNHAPPY ELIT, HAPPINESS SED HAPPY DO HAPPY EIUSMOD TEMPOR HAPPIER, INCIDIDUNT UT ...... LOREM IPSUM HAPPYHAPPINESST AMET, HAPPYHAPPINESSETUR ADIPISCING UNHAPPY ELIT, HAPPINESS SED HAPPY DO HAPPYHAPPINESSLiiLiiLii... LOREM IPSUM, MARRIAGE DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING MARRIED ELIT, MARRY SED HAPPILY DO HAPPILY EIUSMOD TEMPOR MARRIED INCIDIDUNT ...... LOREM IPSUM HAPPINESS DOLOR SIT AMET, HAPPYHAPPINESSETUR ADIPISCING UNHAPPY ELIT, UNHAPPY SED HAPPIER DO HAPPYHAPPINESSLijLij... LOREM IPSUM HAPPINESS DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING UNHAPPY ELIT, ... LOREM IPSUM HAPPINESS DOLOR SIT AMET, HAPPYHAPPINESSLij B C \polygon*(2,12)(40,50)(2,50)AREA FORBIDDEN BYJENSEN’S INEQUALITY Eliza(|beth|beth’s) Darcy(|’s) Bennet(|’s|s) Bingley(|’s|s|s’) Jane(|’s) Wickham(|’s) Collins(|’s) happ(ily|iness|y|ier|iest) Lydia(|’s) Catherine(|’s) lov(e|e’|ed|ely|es|ing|eliness|e-making|er|ers) Gardiner(|’s|s) Lizzy(|’s) Charlotte(|’s) Lucas(|’s|es|es’) danc(e|ed|es|ing) Kitty(|’s) Chapter Rosings William(|’s) handsome(|ly|r|st) beaut(iful|ies|y) Forster(|’s|s) Mary(|’s) Bourgh(|’s) Fitzwilliam(|’s) Hurst(|’s|s)7891011678910logLii {tikzpicture} [scale=.8] {axis}[xmin=-7.5,xmax=0.2,xlabel=log|λ(𝐏)|,xlabel style=yshift=.2cm,ylabel style=yshift=-.5cm,ylabel=Cumul. counts,small,height=4cm,width=5.4cm,ymin=-5,ymax=105, minor x tick num = 1 , minor y tick num = 4 ] \addplot[const plot, draw=blue,thin] plot coordinates (-9.60255,0)(-8.60255,1)(-6.71989,2)(-6.69284,3)(-6.69284,4)(-6.23769,5)(-6.23769,6)(-6.1555,7)(-6.1555,8)(-6.15359,9)(-6.15359,10)(-6.06939,11)(-6.06939,12)(-5.98299,13)(-5.98299,14)(-5.76582,15)(-5.76582,16)(-5.67066,17)(-5.67066,18)(-5.55432,19)(-5.54653,20)(-5.54653,21)(-5.48568,22)(-5.48568,23)(-5.44847,24)(-5.44847,25)(-5.35716,26)(-5.35716,27)(-5.31952,28)(-5.31952,29)(-5.31694,30)(-5.31694,31)(-5.29651,32)(-5.29651,33)(-5.25386,34)(-5.25386,35)(-5.23823,36)(-5.23823,37)(-5.22312,38)(-5.16804,39)(-5.16804,40)(-5.15759,41)(-5.15759,42)(-5.091,43)(-5.091,44)(-5.06026,45)(-5.06026,46)(-5.06002,47)(-5.06002,48)(-5.05806,49)(-5.05806,50)(-5.03161,51)(-5.02479,52)(-5.02479,53)(-4.96514,54)(-4.96514,55)(-4.94638,56)(-4.94638,57)(-4.93113,58)(-4.93113,59)(-4.91171,60)(-4.91171,61)(-4.89796,62)(-4.89796,63)(-4.89273,64)(-4.89273,65)(-4.83405,66)(-4.83405,67)(-4.73089,68)(-4.73089,69)(-4.70593,70)(-4.70593,71)(-4.70431,72)(-4.70431,73)(-4.64997,74)(-4.64997,75)(-4.58039,76)(-4.58039,77)(-4.46919,78)(-4.46919,79)(-4.41246,80)(-4.21215,81)(-4.21215,82)(-4.1256,83)(-4.1256,84)(-3.83236,85)(-3.83236,86)(-3.78693,87)(-3.78693,88)(-3.78497,89)(-3.78497,90)(-3.36602,91)(-3.3166,92)(-3.3166,93)(-3.08806,94)(-3.08806,95)(-2.85641,96)(-2.63599,97)(-2.30467,98)(-2.14544,99)(0.,100)(0.,100); \addplot[const plot, draw=orange!50!yellow,thin] plot coordinates (-10.7025,0)(-9.70254,1)(-6.97026,2)(-6.83086,3)(-6.17283,4)(-6.13546,5)(-6.05554,6)(-6.02619,7)(-6.02619,8)(-5.88851,9)(-5.88851,10)(-5.78722,11)(-5.78722,12)(-5.73496,13)(-5.73496,14)(-5.62732,15)(-5.62732,16)(-5.61596,17)(-5.61596,18)(-5.60439,19)(-5.60439,20)(-5.49642,21)(-5.49642,22)(-5.44225,23)(-5.44225,24)(-5.39887,25)(-5.39887,26)(-5.39739,27)(-5.39739,28)(-5.33575,29)(-5.33575,30)(-5.28906,31)(-5.21107,32)(-5.21107,33)(-5.15066,34)(-5.15066,35)(-5.14974,36)(-5.14974,37)(-5.14752,38)(-5.14752,39)(-5.13523,40)(-5.13523,41)(-5.13437,42)(-5.13184,43)(-5.13184,44)(-5.08877,45)(-5.08877,46)(-5.06909,47)(-5.06909,48)(-5.06067,49)(-5.06067,50)(-5.04885,51)(-5.04885,52)(-5.02879,53)(-5.02879,54)(-5.00133,55)(-5.00133,56)(-4.986,57)(-4.986,58)(-4.96484,59)(-4.96484,60)(-4.91339,61)(-4.83398,62)(-4.83398,63)(-4.80904,64)(-4.80904,65)(-4.80473,66)(-4.80473,67)(-4.75738,68)(-4.69589,69)(-4.64885,70)(-4.64885,71)(-4.62598,72)(-4.62598,73)(-4.62206,74)(-4.51145,75)(-4.51145,76)(-4.46195,77)(-4.44569,78)(-4.33745,79)(-4.29365,80)(-4.29365,81)(-4.09204,82)(-4.09204,83)(-4.08231,84)(-4.08231,85)(-3.81074,86)(-3.81074,87)(-3.801,88)(-3.60207,89)(-3.60207,90)(-3.40818,91)(-3.35394,92)(-3.35394,93)(-2.93705,94)(-2.90322,95)(-2.81764,96)(-2.53974,97)(-2.16768,98)(-2.08388,99)(0.,100)(0.,100); \addplot[const plot, draw=green, thin] plot coordinates (-8.79316,0)(-7.79316,1)(-6.75797,2)(-6.61022,3)(-6.61022,4)(-6.47272,5)(-6.47272,6)(-6.0236,7)(-5.96303,8)(-5.96303,9)(-5.9153,10)(-5.9153,11)(-5.83552,12)(-5.83552,13)(-5.81416,14)(-5.81416,15)(-5.80061,16)(-5.64156,17)(-5.64156,18)(-5.58745,19)(-5.564,20)(-5.564,21)(-5.55927,22)(-5.55927,23)(-5.55281,24)(-5.55281,25)(-5.52048,26)(-5.51195,27)(-5.51195,28)(-5.39437,29)(-5.39437,30)(-5.39236,31)(-5.39236,32)(-5.37015,33)(-5.33845,34)(-5.33845,35)(-5.32709,36)(-5.32709,37)(-5.31967,38)(-5.27754,39)(-5.27754,40)(-5.25642,41)(-5.25642,42)(-5.22265,43)(-5.22265,44)(-5.13865,45)(-5.13865,46)(-5.12128,47)(-5.12128,48)(-5.11585,49)(-5.11585,50)(-5.07714,51)(-5.07714,52)(-5.07597,53)(-5.07597,54)(-5.06297,55)(-5.06297,56)(-4.98985,57)(-4.98985,58)(-4.97372,59)(-4.97372,60)(-4.9733,61)(-4.9733,62)(-4.95968,63)(-4.95968,64)(-4.92635,65)(-4.92635,66)(-4.91569,67)(-4.91569,68)(-4.81722,69)(-4.81722,70)(-4.67325,71)(-4.67325,72)(-4.60852,73)(-4.60852,74)(-4.56349,75)(-4.56349,76)(-4.53975,77)(-4.53975,78)(-4.29814,79)(-4.29814,80)(-4.2556,81)(-4.2556,82)(-4.22271,83)(-4.20521,84)(-3.87155,85)(-3.87155,86)(-3.76321,87)(-3.76321,88)(-3.60627,89)(-3.60627,90)(-3.50899,91)(-3.50899,92)(-3.24141,93)(-2.98135,94)(-2.72734,95)(-2.61449,96)(-2.40471,97)(-2.20349,98)(-2.00457,99)(0.,100)(0.,100); \addplot[const plot, draw=red, thin] plot coordinates (-8.49982,0)(-7.49982,1)(-6.57468,2)(-6.28342,3)(-6.28342,4)(-6.24956,5)(-6.23781,6)(-6.23781,7)(-6.17079,8)(-6.17079,9)(-6.00582,10)(-5.97637,11)(-5.95739,12)(-5.95739,13)(-5.89074,14)(-5.89074,15)(-5.79846,16)(-5.79846,17)(-5.74913,18)(-5.66313,19)(-5.58688,20)(-5.58688,21)(-5.52443,22)(-5.52443,23)(-5.47175,24)(-5.47175,25)(-5.46655,26)(-5.46655,27)(-5.46028,28)(-5.46028,29)(-5.45181,30)(-5.45181,31)(-5.42404,32)(-5.42404,33)(-5.35064,34)(-5.35064,35)(-5.3378,36)(-5.3378,37)(-5.28818,38)(-5.28818,39)(-5.28699,40)(-5.28699,41)(-5.16263,42)(-5.11905,43)(-5.11905,44)(-5.09987,45)(-5.09987,46)(-5.08381,47)(-5.08381,48)(-5.06254,49)(-5.06254,50)(-5.04676,51)(-5.04676,52)(-5.0166,53)(-5.0166,54)(-4.99715,55)(-4.99715,56)(-4.9509,57)(-4.9509,58)(-4.8452,59)(-4.8452,60)(-4.84111,61)(-4.84111,62)(-4.71034,63)(-4.71034,64)(-4.69904,65)(-4.68231,66)(-4.68231,67)(-4.61467,68)(-4.61467,69)(-4.59982,70)(-4.54088,71)(-4.54088,72)(-4.424,73)(-4.40516,74)(-4.40516,75)(-4.34072,76)(-4.34072,77)(-4.32812,78)(-4.21948,79)(-4.21948,80)(-4.21022,81)(-4.21022,82)(-4.13705,83)(-4.13705,84)(-3.89213,85)(-3.89213,86)(-3.81539,87)(-3.69615,88)(-3.69615,89)(-3.6951,90)(-3.6951,91)(-3.52704,92)(-3.33434,93)(-3.13425,94)(-3.05051,95)(-3.03428,96)(-2.57797,97)(-2.26838,98)(-2.15911,99)(0.,100)(0.,100); {tikzpicture} [scale=.8] {axis}[yticklabels=,xlabel style=yshift=.2cm,xmin=-1.15,xmax=1.15,xlabel=1πargλ(𝐏),ylabel=,small,height=4cm,width=5.2cm,ymin=-5,ymax=105, minor x tick num = 4 , minor y tick num = 4 ] \addplot[const plot, draw=blue,thin] plot coordinates (-1.15,0)(-0.971627,1)(-0.962331,2)(-0.931205,3)(-0.914197,4)(-0.878971,5)(-0.849229,6)(-0.828944,7)(-0.810065,8)(-0.785128,9)(-0.771484,10)(-0.738708,11)(-0.727283,12)(-0.672989,13)(-0.672651,14)(-0.645757,15)(-0.628345,16)(-0.595224,17)(-0.565528,18)(-0.561131,19)(-0.461625,20)(-0.459231,21)(-0.446359,22)(-0.443339,23)(-0.354907,24)(-0.348111,25)(-0.318543,26)(-0.298846,27)(-0.272168,28)(-0.210188,29)(-0.178878,30)(-0.172765,31)(-0.1433,32)(-0.134751,33)(-0.102026,34)(-0.0879207,35)(-0.0664715,36)(-0.0633885,37)(-0.0633652,38)(-0.0538922,39)(-0.0430376,40)(-0.03665,41)(-0.0314877,42)(-0.025645,43)(-0.0115001,44)(0.,45)(0.,46)(0.,47)(0.,48)(0.,49)(0.,50)(0.,51)(0.,52)(0.,53)(0.0115001,54)(0.025645,55)(0.0314877,56)(0.03665,57)(0.0430376,58)(0.0538922,59)(0.0633652,60)(0.0633885,61)(0.0664715,62)(0.0879207,63)(0.102026,64)(0.134751,65)(0.1433,66)(0.172765,67)(0.178878,68)(0.210188,69)(0.272168,70)(0.298846,71)(0.318543,72)(0.348111,73)(0.354907,74)(0.443339,75)(0.446359,76)(0.459231,77)(0.461625,78)(0.561131,79)(0.565528,80)(0.595224,81)(0.628345,82)(0.645757,83)(0.672651,84)(0.672989,85)(0.727283,86)(0.738708,87)(0.771484,88)(0.785128,89)(0.810065,90)(0.828944,91)(0.849229,92)(0.878971,93)(0.914197,94)(0.931205,95)(0.962331,96)(0.971627,97)(1.,98)(1.,99)(1.,100)(1.5,100); \addplot[const plot, draw=orange!50!yellow,thin] plot coordinates (-1.15,0)(-0.916107,1)(-0.875047,2)(-0.858074,3)(-0.831626,4)(-0.823231,5)(-0.786729,6)(-0.762415,7)(-0.761265,8)(-0.714303,9)(-0.706217,10)(-0.701381,11)(-0.641877,12)(-0.612211,13)(-0.609151,14)(-0.608027,15)(-0.557074,16)(-0.537947,17)(-0.496309,18)(-0.436892,19)(-0.40743,20)(-0.390132,21)(-0.35458,22)(-0.339094,23)(-0.302773,24)(-0.293819,25)(-0.264424,26)(-0.217111,27)(-0.184558,28)(-0.146021,29)(-0.138377,30)(-0.123565,31)(-0.0868932,32)(-0.0862093,33)(-0.0848761,34)(-0.0507398,35)(-0.0406853,36)(-0.0293969,37)(-0.0287539,38)(0.,39)(0.,40)(0.,41)(0.,42)(0.,43)(0.,44)(0.,45)(0.,46)(0.,47)(0.,48)(0.,49)(0.,50)(0.,51)(0.,52)(0.,53)(0.0287539,54)(0.0293969,55)(0.0406853,56)(0.0507398,57)(0.0848761,58)(0.0862093,59)(0.0868932,60)(0.123565,61)(0.138377,62)(0.146021,63)(0.184558,64)(0.217111,65)(0.264424,66)(0.293819,67)(0.302773,68)(0.339094,69)(0.35458,70)(0.390132,71)(0.40743,72)(0.436892,73)(0.496309,74)(0.537947,75)(0.557074,76)(0.608027,77)(0.609151,78)(0.612211,79)(0.641877,80)(0.701381,81)(0.706217,82)(0.714303,83)(0.761265,84)(0.762415,85)(0.786729,86)(0.823231,87)(0.831626,88)(0.858074,89)(0.875047,90)(0.916107,91)(1.,92)(1.,93)(1.,94)(1.,95)(1.,96)(1.,97)(1.,98)(1.,99)(1.,100)(1.05,100); \addplot[const plot, draw=green, thin] plot coordinates (-1.15,0)(-0.973973,1)(-0.966248,2)(-0.940293,3)(-0.88305,4)(-0.879144,5)(-0.846465,6)(-0.83379,7)(-0.816525,8)(-0.802595,9)(-0.779956,10)(-0.754851,11)(-0.752593,12)(-0.735184,13)(-0.668978,14)(-0.6634,15)(-0.655112,16)(-0.638096,17)(-0.625817,18)(-0.553111,19)(-0.471373,20)(-0.460891,21)(-0.434716,22)(-0.38324,23)(-0.321649,24)(-0.309054,25)(-0.297643,26)(-0.256745,27)(-0.220319,28)(-0.201366,29)(-0.170479,30)(-0.165997,31)(-0.133734,32)(-0.117182,33)(-0.113244,34)(-0.092505,35)(-0.0594225,36)(-0.0502421,37)(-0.0336255,38)(-0.0324377,39)(-0.0085419,40)(-0.00745288,41)(0.,42)(0.,43)(0.,44)(0.,45)(0.,46)(0.,47)(0.,48)(0.,49)(0.,50)(0.,51)(0.,52)(0.,53)(0.,54)(0.,55)(0.00745288,56)(0.0085419,57)(0.0324377,58)(0.0336255,59)(0.0502421,60)(0.0594225,61)(0.092505,62)(0.113244,63)(0.117182,64)(0.133734,65)(0.165997,66)(0.170479,67)(0.201366,68)(0.220319,69)(0.256745,70)(0.297643,71)(0.309054,72)(0.321649,73)(0.38324,74)(0.434716,75)(0.460891,76)(0.471373,77)(0.553111,78)(0.625817,79)(0.638096,80)(0.655112,81)(0.6634,82)(0.668978,83)(0.735184,84)(0.752593,85)(0.754851,86)(0.779956,87)(0.802595,88)(0.816525,89)(0.83379,90)(0.846465,91)(0.879144,92)(0.88305,93)(0.940293,94)(0.966248,95)(0.973973,96)(1.,97)(1.,98)(1.,99)(1.,100)(1.05,100); \addplot[const plot, draw=red, thin] plot coordinates (-1.15,0)(-0.922254,1)(-0.894558,2)(-0.889973,3)(-0.852779,4)(-0.803671,5)(-0.773516,6)(-0.726911,7)(-0.726047,8)(-0.673779,9)(-0.646574,10)(-0.633939,11)(-0.617409,12)(-0.607102,13)(-0.582182,14)(-0.505499,15)(-0.504322,16)(-0.478052,17)(-0.407333,18)(-0.374949,19)(-0.353254,20)(-0.328746,21)(-0.303329,22)(-0.274091,23)(-0.244027,24)(-0.231052,25)(-0.187175,26)(-0.164664,27)(-0.163434,28)(-0.150895,29)(-0.144832,30)(-0.134646,31)(-0.11801,32)(-0.0880681,33)(-0.0632854,34)(-0.0578634,35)(-0.0468788,36)(-0.0318364,37)(-0.01984,38)(-0.0127808,39)(0.,40)(0.,41)(0.,42)(0.,43)(0.,44)(0.,45)(0.,46)(0.,47)(0.,48)(0.,49)(0.,50)(0.,51)(0.,52)(0.,53)(0.,54)(0.,55)(0.,56)(0.0127808,57)(0.01984,58)(0.0318364,59)(0.0468788,60)(0.0578634,61)(0.0632854,62)(0.0880681,63)(0.11801,64)(0.134646,65)(0.144832,66)(0.150895,67)(0.163434,68)(0.164664,69)(0.187175,70)(0.231052,71)(0.244027,72)(0.274091,73)(0.303329,74)(0.328746,75)(0.353254,76)(0.374949,77)(0.407333,78)(0.478052,79)(0.504322,80)(0.505499,81)(0.582182,82)(0.607102,83)(0.617409,84)(0.633939,85)(0.646574,86)(0.673779,87)(0.726047,88)(0.726911,89)(0.773516,90)(0.803671,91)(0.852779,92)(0.889973,93)(0.894558,94)(0.922254,95)(1.,96)(1.,97)(1.,98)(1.,99)(1.,100)(1.05,100);    —– English —– French —– Russian —– Finnish   
Fig. 1: . Statistical analysis of textual features. (A) Counting long-range transitions between word patterns. A transition from 𝖶i to 𝖶j counts towards long-range statistics, if the underlined text fragment in between contains no occurrences of 𝖶i, and lasts strictly longer than the longest word in 𝖶i𝖶j. For each long-range transition, the effective fragment length Lij discounts the length of the longest word in 𝖶i𝖶j. (B) Recurrence statistics for word patterns in Jane Austen’s Pride and Prejudice, where denotes averages over nii samples of long-range transitions. Data points in gray, green and red have radii 14nii. Labels for proper names and some literary motifs are attached next to the corresponding colored dots. Jensen’s bound (green dashed line) has unit slope and zero intercept. Exponentially distributed recurrence statistics reside on the line of Poissonian banality (blue line), with unit slope and negative intercept. Red (resp. green) dots mark significant downward (resp. upward) departure from the blue line. (C) Distributions of eigenvalues λ of empirical Markov matrices 𝐏, with nearly language-independent modulus |λ(𝐏)| and phase-angle argλ(𝐏).

Unlike the single exponential decays associated to non-topical recurrence patterns, the multiple exponential decay modes will enable our robot reader to easily discern one topic from another. In general, it is numerically challenging to recover multiple exponential decay modes from a limited amount of recurrence time measurements (?). However, in text processing, we can circumvent such difficulties by off-diagonal statistics nij and Lij that provide semantic contexts for individual topical patterns.

To quantitatively define the semantic rôle of a topical pattern 𝖶i, we specify a local, directed, and weighted graph, corresponding to a localized Markov transition matrix 𝐏[𝐢]. To localize, we need to remove edges between two vertices 𝖶i and 𝖶j, when Lij and Lji are “long enough” relative to what one could naïvely expect from nij,nji and Lii,Ljj. Here, for naïve expectation, we approximate the probability (logLij>) by a Gaussian model αij() (colored curves in Fig. 2A) whose mean and variance are deducible from nij and Lii [ (?), section 1.4]. The parameters in the Gaussian model are justified by detailed balance on an ergodic Markov chain, and become asymptotically exact if distinct word patterns are statistically independent (such as α13, α24, α31, α34 in Fig. 2A).

Empirically, we find that higher αij() scores point to closer affinities between word patterns (Fig. 2A), attributable to kinship (Elizabeth, Jane), courtship (Darcy, Elizabeth), disposition (Darcy, pride) and so on. Our robot reader automatically detects such affinities, without references other than the novel itself. Therefore, we can use the αij() scores as guides to numerical approximations of semantic fields, hereafter referred to as semantic cliques.

We invite a topical pattern 𝖶j to the semantic clique 𝒮i (Figs. 2A and B, insets) surrounding 𝖶i, if min{αij(logLij),αji(logLji)}>α* for a standard Gaussian threshold α*:=12π-1e-x2/2𝑑x0.8413. This operation emulates the brainstorming procedure of a human reader, who associates one word with another only when they stay much closer than two randomly picked words, according to his/her impression.

A C 01215.56.06.57.07.5α12α14α13𝖶4𝖶2𝖶3 6.57.07.58.08.5α21α24α23𝖶3𝖶1𝖶4 8.59.09.510.0α32α31α34𝖶4𝖶1𝖶2 𝖶1=Eliza(|beth|beth’s),𝖶2=Darcy(|’s),𝖶3=pr(ide|ided|oud|oudly|oudest),𝖶4=Jane(|’s) B {tikzpicture} [scale=.8] {axis}[xmin=-0.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=-log|λ(𝐑[𝐢])|,ylabel=Cumul. counts,ylabel style=yshift=-.2cm,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50 , minor y tick num = 4, minor x tick num = 1 ] \addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.32220795295265,1)(1.56877450953151,2)(2.05343577065995,3)(2.61911591176899,4)(2.78144473392766,5)(2.78144473392766,6)(3.11707459762707,7)(3.11707459762707,8)(3.25995308658605,9)(3.35237510365735,10)(3.76374692832456,11)(3.76374692832456,12)(4.00703628694193,13)(4.00703628694193,14)(4.15641192195518,15)(4.15641192195518,16)(4.24435241361338,17)(4.24435241361338,18)(4.47969541700018,19)(4.47969541700018,20)(4.53617709320394,21)(4.54884684145921,22)(4.54884684145921,23)(5.28429991388079,24)(5.45839630641384,25)(7.41424594439814,26); \addplot [const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.42468705544310,1)(1.91388684511641,2)(2.29733030100358,3)(2.66301554498268,4)(2.66301554498268,5)(3.09409423954336,6)(3.48046820854723,7)(3.80322578694118,8)(3.80322578694118,9)(3.85068167367354,10)(3.87569656670043,11)(3.87569656670043,12)(4.40793959315427,13)(4.40793959315427,14)(4.42646434780309,15)(4.42646434780309,16)(4.43152478531075,17)(4.47878576400518,18)(4.48812958313127,19)(4.48812958313127,20)(5.42834103810266,21)(5.42834103810266,22)(5.95899543326392,23)(5.95899543326392,24)(7.59402470747260,25); \addplot [const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.26171372634093,1)(1.88893338072425,2)(1.99329057312888,3)(2.44274759970787,4)(3.04341515963974,5)(3.18415044115211,6)(3.63388682122032,7)(3.63388682122032,8)(3.78700632107653,9)(3.78700632107653,10)(4.13561444559012,11)(4.38374323961263,12)(4.38374323961263,13)(4.50094057637412,14)(4.50094057637412,15)(4.63602092956864,16)(4.63602092956864,17)(5.04104386297606,18)(5.05664349700382,19)(5.05664349700382,20)(5.08747913725148,21)(5.08747913725148,22)(5.16238491628622,23)(5.16238491628622,24)(5.17158763706523,25)(5.17158763706523,26)(5.23056389303276,27)(5.23056389303276,28)(5.66886302133743,29)(5.66886302133743,30)(6.01579697360825,31)(6.01579697360825,32); \addplot [const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.22762850582409,1)(2.26050288321408,2)(2.26050288321408,3)(2.73886600232238,4)(3.05870527819150,5)(3.05870527819150,6)(3.27300384137315,7)(3.27300384137315,8)(3.42002404126779,9)(3.64631122138641,10)(3.81536423904743,11)(3.81536423904743,12)(3.82573476426487,13)(4.08516569232732,14)(4.08516569232732,15)(4.15021193981405,16)(4.15021193981405,17)(4.28304595006059,18)(4.28304595006059,19)(4.54701970877615,20)(4.63446882401989,21)(4.63446882401989,22)(4.92518442100042,23)(4.92518442100042,24)(4.95195210048509,25)(4.95195210048509,26)(5.34354311467649,27)(5.34354311467649,28)(5.93737642689069,29)(5.93737642689069,30)(7.18161792558578,31)(7.18161792558578,32); \addplot [const plot,thin,draw=blue] plot coordinates (0,0)(0.32220795295265,1)(1.56877450953151,2)(2.05343577065995,3)(2.61911591176899,4)(2.78144473392766,5)(2.78144473392766,6)(3.11707459762707,7)(3.11707459762707,8)(3.25995308658605,9)(3.35237510365735,10)(3.76374692832456,11)(3.76374692832456,12)(4.00703628694193,13); \addplot [const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.42468705544310,1)(1.91388684511641,2)(2.29733030100358,3)(2.66301554498268,4)(2.66301554498268,5)(3.09409423954336,6)(3.48046820854723,7)(3.80322578694118,8)(3.80322578694118,9)(3.85068167367354,10); \addplot [const plot,thin,draw=green] plot coordinates (0,0)(0.26171372634093,1)(1.88893338072425,2)(1.99329057312888,3)(2.44274759970787,4)(3.04341515963974,5)(3.18415044115211,6)(3.63388682122032,7)(3.63388682122032,8)(3.78700632107653,9)(3.78700632107653,10)(4.13561444559012,11)(4.38374323961263,12)(4.38374323961263,13); \addplot [const plot,thin,draw=red] plot coordinates (0,0)(0.22762850582409,1)(2.26050288321408,2)(2.26050288321408,3)(2.73886600232238,4)(3.05870527819150,5)(3.05870527819150,6)(3.27300384137315,7)(3.27300384137315,8)(3.42002404126779,9)(3.64631122138641,10)(3.81536423904743,11)(3.81536423904743,12)(3.82573476426487,13)(4.08516569232732,14)(4.08516569232732,15)(4.15021193981405,16); {tikzpicture} [scale=.8] {axis}[yticklabels=,xmin=-.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=-log|λ(𝐑[𝐢])|,ylabel=,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50 , minor y tick num = 4, minor x tick num = 1 ] \addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.10685330578770,1)(2.34182889171998,2)(2.52161496341122,3)(2.58077702739777,4)(2.79517328598574,5)(3.24787174744998,6)(3.24787174744998,7)(3.58141808763777,8)(3.58141808763777,9)(3.77810167778609,10)(3.94711575688057,11)(3.99354169872394,12)(3.99354169872394,13)(4.00652691746312,14)(4.00652691746312,15)(4.02441164447511,16)(4.30573731427503,17)(4.30573731427503,18)(4.37401030869072,19)(4.74497203951153,20)(4.74497203951153,21)(4.78260211345399,22)(4.78260211345399,23)(4.80154438232183,24)(4.80154438232183,25)(4.87774758339092,26)(4.87774758339092,27)(5.09899332789938,28)(5.10293615110034,29)(5.10293615110034,30)(5.14288499586647,31)(5.43837726757906,32)(5.45884314463469,33)(5.45884314463469,34)(5.70115750083496,35)(6.69090889587401,36); \addplot [const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.13611000011202,1)(2.00976392890417,2)(2.71726525568887,3)(2.80086573012721,4)(3.00015087704195,5)(3.19099977683431,6)(3.35751807709522,7)(3.38714559131331,8)(3.49248353946481,9)(3.49248353946481,10)(3.54202638843791,11)(3.79717447508252,12)(3.79717447508252,13)(3.96621389826330,14)(3.96621389826330,15)(4.06514679091855,16)(4.06514679091855,17)(4.22819668144596,18)(4.22819668144596,19)(4.39611234743818,20)(4.39611234743818,21)(4.58288536220391,22)(4.58288536220391,23)(4.63531765668741,24)(4.63531765668741,25)(4.67308668625852,26)(4.67308668625852,27)(4.77291232844163,28)(4.96232385489969,29)(4.96232385489969,30)(5.19806601729738,31)(5.19806601729738,32)(5.25279908996314,33)(5.25279908996314,34)(5.86160578978962,35)(5.86160578978962,36); \addplot [const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.12947426313459,1)(2.12591385705612,2)(2.73503738745595,3)(2.96057704877546,4)(3.12862939551519,5)(3.21555419635342,6)(3.21555419635342,7)(3.37339463508671,8)(3.52434218331204,9)(3.52434218331204,10)(3.76285561735342,11)(3.80442663006108,12)(3.87058728494671,13)(3.87058728494671,14)(4.12311255923610,15)(4.12311255923610,16)(4.43355641111795,17)(4.43355641111795,18)(4.60185459675352,19)(4.60185459675352,20)(4.61689734885113,21)(4.73650476926929,22)(4.73650476926929,23)(4.77083049338317,24)(4.77083049338317,25)(4.93509867070674,26)(4.93509867070674,27)(4.97934802848736,28)(4.97934802848736,29)(5.20981706087254,30)(5.20981706087254,31)(5.53642088357288,32)(5.53642088357288,33)(6.18102425629127,34)(6.18102425629127,35)(6.22870960716267,36)(6.22870960716267,37); \addplot [const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.11392989752484,1)(2.22998150860765,2)(2.90646071204164,3)(2.97583803997370,4)(3.27653225262211,5)(3.34085172231751,6)(3.48786058242352,7)(3.48786058242352,8)(3.56381854562820,9)(3.65131478042790,10)(3.65131478042790,11)(3.82538361848595,12)(3.82538361848595,13)(4.03554500407929,14)(4.03554500407929,15)(4.09427885952181,16)(4.09427885952181,17)(4.28772684934941,18)(4.35663152905546,19)(4.35663152905546,20)(4.43211501600674,21)(4.43211501600674,22)(4.50150679807307,23)(4.50150679807307,24)(4.62796984207214,25)(4.62796984207214,26)(4.80442226561453,27)(4.80442226561453,28)(4.83208367714186,29)(4.83208367714186,30)(4.98555309376678,31)(4.98555309376678,32)(5.07099461131659,33)(5.07099461131659,34)(5.10189860328778,35)(5.10189860328778,36)(5.28003612322993,37)(5.46061695511284,38)(5.46061695511284,39)(5.81964108671420,40)(5.81964108671420,41)(5.82459557318953,42)(5.82459557318953,43)(9.37119008453853,44); \addplot [const plot,thin,draw=blue] plot coordinates (0,0)(0.10685330578770,1)(2.34182889171998,2)(2.52161496341122,3)(2.58077702739777,4)(2.79517328598574,5)(3.24787174744998,6)(3.24787174744998,7)(3.58141808763777,8)(3.58141808763777,9)(3.77810167778609,10)(3.94711575688057,11)(3.99354169872394,12)(3.99354169872394,13)(4.00652691746312,14)(4.00652691746312,15)(4.02441164447511,16)(4.30573731427503,17)(4.30573731427503,18)(4.37401030869072,19)(4.74497203951153,20); \addplot [const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.13611000011202,1)(2.00976392890417,2)(2.71726525568887,3)(2.80086573012721,4)(3.00015087704195,5)(3.19099977683431,6)(3.35751807709522,7)(3.38714559131331,8)(3.49248353946481,9)(3.49248353946481,10)(3.54202638843791,11)(3.79717447508252,12)(3.79717447508252,13)(3.96621389826330,14)(3.96621389826330,15)(4.06514679091855,16)(4.06514679091855,17); \addplot [const plot,thin,draw=green] plot coordinates (0,0)(0.12947426313459,1)(2.12591385705612,2)(2.73503738745595,3)(2.96057704877546,4)(3.12862939551519,5)(3.21555419635342,6)(3.21555419635342,7)(3.37339463508671,8)(3.52434218331204,9)(3.52434218331204,10)(3.76285561735342,11)(3.80442663006108,12)(3.87058728494671,13)(3.87058728494671,14)(4.12311255923610,15)(4.12311255923610,16); \addplot [const plot,thin,draw=red] plot coordinates (0,0)(0.11392989752484,1)(2.22998150860765,2)(2.90646071204164,3)(2.97583803997370,4)(3.27653225262211,5)(3.34085172231751,6)(3.48786058242352,7)(3.48786058242352,8)(3.56381854562820,9)(3.65131478042790,10)(3.65131478042790,11)(3.82538361848595,12)(3.82538361848595,13)(4.03554500407929,14)(4.03554500407929,15)(4.09427885952181,16)(4.09427885952181,17)(4.28772684934941,18)(4.35663152905546,19)(4.35663152905546,20)(4.43211501600674,21); {tikzpicture} [scale=.8] {axis}[yticklabels=,xmin=-.5,xmax=9.5,xlabel style=yshift=.2cm,xlabel=-log|λ(𝐑[𝐢])|,ylabel=,small,height=4.5cm,width=7.05cm,ymin=0,ymax=50 , minor y tick num = 4, minor x tick num = 1 ] \addplot[const plot,thin,draw=blue,densely dashed] plot coordinates (0,0)(0.03375045936627,1)(1.84339112866770,2)(2.04989352100626,3)(2.25344865520748,4)(2.73109506060704,5)(2.99663941125549,6)(3.25512778898816,7)(3.32365312299691,8)(3.32365312299691,9)(3.46282618345518,10)(3.46282618345518,11)(3.65393321900561,12)(3.65393321900561,13)(3.95174049263931,14)(4.00270682911275,15)(4.27842169773721,16)(4.27842169773721,17)(4.52962618522869,18)(4.52962618522869,19)(4.65850485399020,20)(4.66269224968636,21)(4.66269224968636,22)(4.86896088465735,23)(4.86896088465735,24); \addplot [const plot,thin,draw=yellow!50!orange,densely dashed] plot coordinates (0,0)(0.01337958226662,1)(1.23627718146471,2)(2.56078706446613,3)(2.65671196285956,4)(2.88596406752471,5)(3.05972637203348,6)(3.43206174427130,7)(3.70810818547555,8)(3.70810818547555,9)(3.72728394611638,10)(3.72728394611638,11)(4.43243048108062,12)(4.43243048108062,13)(4.46468420506806,14)(4.46468420506806,15)(4.67752220750291,16)(4.67752220750291,17)(5.11649724615875,18)(5.11649724615875,19)(5.59658036171709,20)(5.59658036171709,21); \addplot [const plot,thin,draw=green,densely dashed] plot coordinates (0,0)(0.02241646271159,1)(1.61053743998943,2)(2.06584195318460,3)(2.41340303393332,4)(2.84154690083292,5)(2.98507542500559,6)(2.98507542500559,7)(3.47403721759109,8)(3.47403721759109,9)(3.55871255413353,10)(3.55871255413353,11)(3.70826614565336,12)(3.80643800157878,13)(3.80643800157878,14)(4.15261240523967,15)(4.20851265325287,16)(4.20851265325287,17)(4.27118141523191,18)(4.27118141523191,19)(4.62713773689971,20)(4.62713773689971,21)(4.81634994746198,22)(4.93066411430782,23)(5.45149832787469,24)(5.90147619724694,25)(6.59536610798741,26); \addplot [const plot,thin,draw=red,densely dashed] plot coordinates (0,0)(0.03331149807929,1)(1.50662538164952,2)(2.06396151127533,3)(2.40676779108242,4)(3.01358374902029,5)(3.01358374902029,6)(3.13965009917626,7)(3.13965009917626,8)(3.30291818059154,9)(3.49139131768527,10)(3.97845584868851,11)(3.97845584868851,12)(3.99454267644476,13)(3.99454267644476,14)(4.37719479245446,15)(4.45982096576618,16)(4.66172363432402,17)(4.96539893960865,18)(5.41411521574973,19)(5.41411521574973,20)(6.18257206271890,21)(6.18257206271890,22); \addplot [const plot,thin,draw=blue] plot coordinates (0,0)(0.03375045936627,1)(1.84339112866770,2)(2.04989352100626,3)(2.25344865520748,4)(2.73109506060704,5)(2.99663941125549,6)(3.25512778898816,7)(3.32365312299691,8)(3.32365312299691,9)(3.46282618345518,10)(3.46282618345518,11)(3.65393321900561,12)(3.65393321900561,13)(3.95174049263931,14); \addplot [const plot,thin,draw=yellow!50!orange] plot coordinates (0,0)(0.01337958226662,1)(1.23627718146471,2)(2.56078706446613,3)(2.65671196285956,4)(2.88596406752471,5)(3.05972637203348,6)(3.43206174427130,7); \addplot [const plot,thin,draw=green] plot coordinates (0,0)(0.02241646271159,1)(1.61053743998943,2)(2.06584195318460,3)(2.41340303393332,4)(2.84154690083292,5)(2.98507542500559,6)(2.98507542500559,7)(3.47403721759109,8)(3.47403721759109,9)(3.55871255413353,10)(3.55871255413353,11)(3.70826614565336,12)(3.80643800157878,13)(3.80643800157878,14); \addplot [const plot,thin,draw=red] plot coordinates (0,0)(0.03331149807929,1)(1.50662538164952,2)(2.06396151127533,3)(2.40676779108242,4)(3.01358374902029,5)(3.01358374902029,6)(3.13965009917626,7)(3.13965009917626,8)(3.30291818059154,9)(3.49139131768527,10)(3.97845584868851,11)(3.97845584868851,12); Indo-European Danish    German     Dutch    Spanish    French    Latin     Polish    Russian     Koreanic Korean     Turkic Turkish    Uralic Finnish     Hungarian     Vasconic Basque                                           #Topics: 0     20     40    60    80   100 120 correct close incorrect D E

WikiQA-Q26: How did Anne Frank die? Reference: “Anne Frank” (Wikipedia)      \polygon*(0,17)(225,17)(225,32)(0,32) (1) Anne Frank and her sister, Margot , were eventually transferred to the Bergen-Belsen concentration camp , where they died of typhus in March 1945. (2) Annelies "Anne" Marie Frank (, ?, ; 12 June 1929early March 1945) is one of the most discussed Jewish victims of the Holocaust . (3) Otto Frank, the only survivor of the family, returned to Amsterdam after the war to find that Anne’s diary had been saved, and his efforts led to its publication in 1947. (4) As persecutions of the Jewish population increased in July 1942, the family went into hiding in the hidden rooms of Anne’s father, Otto Frank ’s, office building. (5) The Frank family moved from Germany to Amsterdam in 1933, the year the Nazis gained control over Germany.

MAP MRR
 0.6190 CNN  0.6281
 0.6086 LISF*  0.6263
 0.5993 LCLR  0.6086
 0.5897 LISF  0.6060
 0.5110 PV  0.5160
 0.4891 Word Count  0.4924
 0.3913 Random Sort  0.3990
Fig. 2: . Semantic cliques and their applications. (A) Empirical distributions of logLij in Pride and Prejudice, as gray and colored dots with radii 14nij, compared to Gaussian model αij() (colored curves parametrized by [ (?), equations (1.10)–(1.11)]). The numerical samplings of 𝖶j’s exhaust all the textual patterns available in the novel, including topical word patterns, non-topical word patterns and function words. Only those textual patterns with over 40 occurrences are displayed as data points. Inset of each frame shows the semantic clique 𝒮i surrounding topic 𝖶i (painted in black), color-coded by the αij(logLij) score. The areas of the bounding boxes for individual word patterns are proportional to the components of 𝝅[i] (the equilibrium state of 𝐏[𝐢]). (B) Distributions for the magnitudes of eigenvalues (LISF) in the recurrence matrices 𝐑[𝐢], for three concepts from four versions of Pride and Prejudice. The color encoding for languages follows Fig. 1C. The largest eηi magnitudes of eigenvalues are displayed as solid lines, while the remaining terms are shown in dashed lines. Inset of each frame shows the semantic clique 𝒮i, counterclockwise from top-left, in French, Russian and Finnish. (C) Yields from bipartite matching of LISF for topical words between the English original of Pride and Prejudice and its translations into 13 languages out of 5 language families. (D) A construction of semantic clique 𝒬𝒬 (based on 𝒬={Anne, Frank, die}) weighted by the PageRank equilibrium state 𝝅~ and subsequent question-answering. Top 5 candidate answers, with punctuation and spacing as given by WikiQA, are shown with font sizes proportional to the entropy production score [ (?), equation (1.16)]. Here, the top-scoring sentence with highlighted background is the same as the official answer chosen by the WikiQA team. Like a human reader, our algorithm automatically detects the place “Bergen-Belsen concentration camp”, cause “typhus”, and year “1945” of Anne Frank’s death. (E) Evaluations of our model (LISF and LISF*) on the WikiQA data set, in comparison with established algorithms.

On a local graph with vertices 𝒮i={𝖶i1=𝖶i, 𝖶i2, , 𝖶iNi}, we specify the connectivity of each directed edge by a localized Markov matrix 𝐏[𝐢]=(𝐩𝐣𝐤[𝐢])𝟏𝐣,𝐤𝐍𝐢. This localized Markov matrix is the row-wise normalization of an Ni×Ni subblock of 𝐏 with the same set of vertices as 𝒮i. Resetting the entries p1k[i] and pj1[i] as zero, one arrives at the localized recurrence matrix 𝐑[𝐢]. We call 𝐑[𝐢] a recurrence matrix, because one can use it to compute the distribution for recurrence times to the Markov state 𝖶i in 𝒮i.

Experimentally, we resolve the connectivity of an individual pattern 𝖶i through the recurrence spectrum σ(𝐑[𝐢]) (Fig. 2B). The dominant eigenvalues of 𝐑[𝐢] are concept-specific while remaining nearly language-independent (a localized version of the invariance in Fig. 1C). Such empirical evidence motivates us to define the language-independent semantic fingerprint (LISF) of a word pattern 𝖶i by a descending list for the magnitudes of eigenvalues 𝐯𝐢=(|λ𝟏(𝐑[𝐢])|,|λ𝟐(𝐑[𝐢])|,), computable from its semantic clique 𝒮i. We zero-pad this vector from the (eηi+1)st component onwards, where ηi is the entropy production rate (?) of the Markov matrix 𝐏[𝐢], measured in nats per word. Via bipartite matching [ (?), section 1.5, fig. S6; figs. S11–S23] of word vectors 𝐯𝐢 across languages, our algorithm translates words from parallel texts at very high precision (Fig. 2C; tables S3–S4).

So far, our semantic cliques 𝒮i (Figs. 2A and B, insets) pick up words by numerical brainstorming from 𝖶i. These cliques inform us about their center word 𝖶i, through several types of semantic relations, including, but not limited to

  • Synonyms (pride and vanity in English, orgeuil and fierté in French, etc.);

  • Temperaments (Elizabeth, a delightful girl, often laughs, corresponding to French verbs sourire and rire);

  • Co-references (e.g. Darcy as a personification of pride);

  • Causalities (such as pride based on fortune).

In the light of this, these semantic cliques 𝒮i are useful in text comprehension and question answering. We can expand a set of question words 𝒬 into 𝒬𝒬, by bringing together the semantic cliques generated from a reference text by each and every question word [ (?), section 1.6].

A sample work flow is shown in Fig. 2D, to illustrate how our rudimentary question-answering machine handles a query. To answer a question, we use a single Wikipedia page (without infoboxes and other structural data) as the only reference document and training source. Like a typical human reader of Wikipedia, our numerical associative reasoning generates a weighted set of nodes 𝒬𝒬 (presented graphically as a thought bubble in Fig. 2D), without the help of external stimuli or knowledge feed. Here, the relative weights [ (?), section 1.6] in the nodes of 𝒬𝒬 are computable from the PageRank algorithm (?).

We then test our semantic model (LISF in Fig. 2E) on all the 1242 questions in the WikiQA data set, each of which is accompanied by at least one correct answer located in a designated Wikipedia page. Our algorithm’s performance is roughly on par with LCLR and CNN benchmarks (?), improving upon the baseline by significant margin. This is perhaps remarkable, considering the relatively scant data at our disposal. Unlike the LCLR approach, our numerical discovery of synonyms does not draw on the WordNet database (?) or pre-existent corpora of question-answer pairs. Unlike the CNN method, we do not need pre-trained word2vec embeddings (?) as semantic input.

Moreover, our algorithm (LISF* in Fig. 2E) performs slightly better on a subset of 990 questions that do not require quantitative cues (How large? How long? How many? How old? What became of? What happened to? What year? and so on). This indicates that our structural model fits associative reasoning better than rule-based reasoning (?), while imitating human behaviour in the presence of limited data.

In our current work, we define semantics through algebraic invariants that are concept-specific and language-independent. To construct such invariants, we develop a stochastic model that assigns a semantic fingerprint (list of recurrence eigenvalues) to each concept via its long-range contexts. Consistently using a single Markov framework, we are able to extract topics (Fig. 1B; figs. S10–S23), translate topics (Figs. 1C, 2B and C; figs. S11–S23) and understand topics (Figs. 2A, D and E), through statistical mining of short and medium-length texts. In view of these three successful applications, we are probably close to a complete set of semantic invariants, after demystifying the long-range behaviour of human languages. Thanks to the independence between semantics and syntax (?), our current model conveniently ignores the non-Markovian syntactic structures which are essential to fluent speech. In the near future, we hope to extend our framework further, to incorporate both Markovian and non-Markovian features across different ranges. The Mathematical Principles of Natural Languages, as we envision, must and will combine the statistical analysis of a Markov model with linguistic properties on shorter time scales that convey morphological (?, ?, ?, ?) and syntactical (?, ?, ?, ?, ?) information.

References and Notes

  • 1. A. D. Friederici, J. Bahlmann, S. Heim, R. I. Schubotz, A. Anwander, Proc. Natl. Acad. Sci. USA 103, 2458 (2006).
  • 2. M. A. Nowak, D. C. Krakauer, Proc. Natl. Acad. Sci. USA 96, 8028 (1999).
  • 3. C. Everett, D. E. Blasí, S. G. Roberts, Proc. Natl. Acad. Sci. USA 112, 1322 (2015).
  • 4. C. Everett, Frontiers in Psychology 8, Article 1285 (2017).
  • 5. S. Pinker, Nature 387, 547 (1997).
  • 6. W. D. Marslen-Wilson, L. K. Tyler, Nature 387, 592 (1997).
  • 7. E. Lieberman, J.-B. Michel, J. Jackson, T. Tang, M. A. Nowak, Nature 449, 713 (2007).
  • 8. M. G. Newberry, C. A. Ahern, R. Clark, J. B. Plotkin, Nature 551, 223 (2017).
  • 9. S. Pinker, Nature 404, 441 (2000).
  • 10. M. A. Nowak, J. B. Plotkin, V. A. A. Jansen, Nature 404, 495 (2000).
  • 11. N. Chomsky, Syntactic Structures (Mouton de Gruyter, Berlin, Germany, 2002), second edn.
  • 12. M. Dunn, S. J. Greenhill, S. C. Levinson, R. D. Gray, Nature 473, 79 (2011).
  • 13. A. D. Friederici, N. Chomsky, R. C. Berwick, A. Moro, J. J. Bolhuis, Nat. Hum. Behav. 1, 713 (2017).
  • 14. Y. Yang, W.-t. Yih, C. Meek, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, Lisbon, Portugal, 2015).
  • 15. V. Tshitoyan, et al., Nature 571, 95 (2019).
  • 16. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Advances in Neural Information Processing Systems 26 (NIPS, La Jolla, CA, 2013), pp. 3111–3119.
  • 17. S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, Transactions of the Association for Computational Linguistics 4, 385 (2016).
  • 18. F. de Saussure, Cours de linguistique générale (Payot, Paris, France, 1949), fifth edn.
  • 19. S. Pinker, A. Prince, Cognition 28, 73 (1988).
  • 20. A. D. Friederici, Language Comprehension: A Biological Perspective, A. D. Friederici, ed. (Springer, Berlin, Germany, 1999), chap. 9, pp. 265–304.
  • 21. Materials and methods are available as supplementary materials.
  • 22. J. P. Herrera, P. A. Pury, Eur. Phys. J. B 63, 135 (2008).
  • 23. Y. Zhou, X. Zhuang, Biophys. J. 91, 4045 (2006).
  • 24. T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley Interscience, Hoboken, NJ, 2006), second edn.
  • 25. L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web, Tech. rep., Stanford InfoLab (1999). http://ilpubs.stanford.edu:8090/422/.
  • 26. C. Fellbaum, ed., WordNet: An Electronic Lexical Database (Language, Speech, and Communication) (MIT Press, Cambridge, MA, 1998).
  • 27. S. A. Sloman, Psychol. Bull. 119, 3 (1996).

Acknowledgements

We thank N. Chomsky and S. Pinker for their inputs on several problems of linguistics. We thank X. Sun for discussions on neural networks. We thank X. Wan, R. Yan and D. Zhao for their suggestions on experimental design, during the early stages of this work. Author contributions: W. E, Y. Z. designed the research. Y.Z. collected multilingual data, developed algorithms, and conducted numerical experiments. W. E, Y. Z. analyzed data and wrote the paper. Competing interests: The authors declare no competing interests.

Supplementary Materials

Materials and Methods

Figs. S1 to S23

Tables S1 to S12

References (28–79)