Inspired by chemical kinetics and neurobiology, we propose a mathematicaltheory for pattern recurrence in text documents, applicable to a wide varietyof languages. We present a Markov model at the discourse level for StevenPinker's "mentalese", or chains of mental states that transcend thespoken/written forms. Such (potentially) universal temporal structures oftextual patterns lead us to a language-independent semantic representation, ora translationally-invariant word embedding, thereby forming the common groundfor both comprehensibility within a given language and translatability betweendifferent languages. Applying our model to documents of moderate lengths,without relying on external knowledge bases, we reconcile Noam Chomsky's"poverty of stimulus" paradox with statistical learning of natural languages.
Quick Read (beta)
A Mathematical Model for Linguistic Universals
Weinan E${}^{1,2\ast}$, Yajun Zhou${}^{2\ast}$
${}^{1}$Department of Mathematics & Program in Applied and Computational Mathematics,
Princeton University, Princeton, NJ 08544, USA
${}^{2}$Beijing Institute of Big Data Research, Beijing 100871, P. R. China
Inspired by chemical kinetics and neurobiology, we propose a mathematical theory for pattern recurrence in text documents, applicable to a wide variety of languages. We present a Markov model at the discourse level for Steven Pinker’s “mentalese”, or chains of mental states that transcend the spoken/written forms. Such (potentially) universal temporal structures of textual patterns lead us to a language-independent semantic representation, or a translationally-invariant word embedding, thereby forming the common ground for both comprehensibility within a given language and translatability between different languages. Applying our model to documents of moderate lengths, without relying on external knowledge bases, we reconcile Noam Chomsky’s “poverty of stimulus” paradox with statistical learning of natural languages.
We human beings distinguish ourselves from other animals (?, ?, ?), in that our brain development (?, ?, ?) enables us to convey sophisticated ideas and to share individual experiences, via languages (?, ?, ?). Texts written in natural languages constitute a major medium that perpetuates our civilizations (?), as a cumulative body of knowledge.
The quantitative mechanism underlying the mental faculties of language has long been a difficult problem for anthropologists, linguists, neurobiologists and psychologists (?, ?, ?, ?, ?), before attracting the attention of computer and data scientists (?, ?, ?, ?, ?, ?), in the recent wave of artificial intelligence. Instead of marveling at the partial success of data-hungry approaches (?, ?, ?, ?) to machine learning, we still crave for a cost-effective, interpretable and universal algorithm for understanding natural languages—one that mimics language acquisition and knowledge accumulation during early childhood, based on limited resources, as in Chomsky’s “poverty of stimulus” scenario (?, ?). Without filling the gap of data sizes, one cannot satisfactorily answer nativists’ criticism (?) against empiricists’ statistical models for natural languages.
Rising to the challenges outlined above, we perform a detailed mathematical analysis for computable “linguistic universals”—statistical patterns common to a wide range of human languages.
On the theoretical side, we will present a stochastic “mentalese” model that depicts the timecourse of Markov states behind individual concepts.
On the practical side, we will demonstrate (through automated word translation and question answering) that word’s meaning can be numerically characterized by moderate-sized Markov neural networks, even when there is relatively scant data input.
Our Markov model explains, up to acceptably small error margins, how our innate language faculties (nature) may help us understand the world, by connecting dots of our past experiences (nurture), irrespective of our mother tongue.
Bridging nature to nurture, our stochastic algorithm for Markov neural semantics reconciles the views of nativists and empiricists.
Heuristic background
Languages differ in their phonemic repertoires (“elementary particles” in Jakobson’s (?) terms), word morphologies (“atoms”) and syntactic structures (“molecules”), corresponding to the three short time scales (phonological processing level, lexical level, and sentence level) in the Friederici hierarchy (?), which are mapped to different brain regions in functional magnetic resonance imaging (fMRI). These three Friederici scales exhibit no universal linguistic patterns and bear no semantic significance.
Ferdinand de Saussure’s foundational work (?) rules out semantic dependence on phonological representation (except for a limited set of onomatopoeias), while the inherent meaning of a word is affected by neither its morphological parameters (say, singular vs. plural, present vs. past) nor its syntactic rôles (say, subject vs. object, active vs. passive).
Based on the foregoing arguments, one might speculate that universal semantic content, or Pinker’s “mentalese” (?), may only exist at the discourse level (“bulk materials”, if we extrapolate Jakobson’s (?) metaphor), namely, on the longest time scale in Friederici’s neurobiological hierarchy (?). In this work, we turn such a qualitative speculation into a quantitative model (?). Concretely speaking, we observe the following statistical features of textual patterns (clusters of words that are morphologically related, see Fig. 1 and Fig. 3B for examples) shared by many languages in common:
1.
The recurrence behavior of most textual patterns is consistent with time series generated by a certain Markov process, on the longest, as opposed to the shortest (?), neuro-linguistic time scale;
2.
Recurrence kinetics of a given concept nearly remains independent of the language in which it is expressed;
3.
Kinetic data quantify the semantic distance between different textual patterns, thus allowing us to construct semantic fields by statistical computations.
These long-range temporal features of documents written in various languages, in our opinion, point to a universal kinetic mechanism that defines the semantic rôles of individual nodes in a web of words, mathematically and linguistically.
To begin, we show how to numerically construct a Markov matrix from a realistic text document, and how this Markov model enables us to interpret long-range temporal structures that are common to a wide variety of languages.
In our kinetic language model, we assume that (the gists of) texts are generated by a discrete Markov process on a semantic web with $N$ nodes, each of which represents a textual pattern ${\U0001d5b6}_{k}$—a set of morphologically related content words (?), indexed by an integer $k\in \{1,2,\mathrm{\dots},N\}$—occurring in a given document (?). The stochastic hoppings between the nodes are governed by an ergodic Markov transition matrix $\mathbf{P}={({\mathbf{p}}_{\mathrm{\mathbf{i}\mathbf{j}}})}_{\mathrm{\U0001d7cf}\le \mathbf{i},\mathbf{j}\le \mathbf{N}}$, which (putatively) caricatures the dynamics of mental activities (?) underlying a text, on the time scale of discourse level. We emphasize that our Markov model for the long-range behavior of human languages is independent of Chomsky’s transformational generative grammar (?), the latter of which characterizes short-range syntactic features as hierarchical trees without Markovian structures.
One can estimate transition probabilities between textual patterns on short time scales, by simply counting unigrams and bigrams in a large corpus (?). To estimate long-range transition probabilities from documents of moderate lengths (e.g. a literary piece, a Wikipedia page), i.e. to learn despite a “poverty of the stimulus”, we need some makeshift strategies.
Given a timecourse of molecular states in a biochemical reaction, we can partially reconstruct kinetic information (?) from the probability distribution for the waiting time between consecutive encounters of the same molecular state. Carrying this waiting time analysis a little further, we put a crude estimate of the transition probability ${p}_{ij}$ as
where ${n}_{ij}$ counts the number of effective transitions from ${\U0001d5b6}_{i}$ to ${\U0001d5b6}_{j}$, and ${L}_{ij}$ is a statistic that measures the reduced fragment lengths of such transitions (Fig. 1).
On the diagonal, the Gibbs weights ${n}_{ii}{e}^{-\u27e8\mathrm{log}{L}_{ii}\u27e9}$ hearken back to the TF-IDF measure of word importance (?, ?). Off the diagonal, the ensemble average $\u27e8\mathrm{log}{L}_{ij}\u27e9$ weighs the cost of biochemical activation energy required to jump from ${\U0001d5b6}_{i}$ to ${\U0001d5b6}_{j}$, so that the memorability factor ${e}^{-\u27e8\mathrm{log}{L}_{ij}\u27e9}$ can be viewed as a naïve estimate for the rate of associative learning per copy number ${n}_{ij}$, in Hebb’s fire-and-wire process (?). It is worth noting that our estimate of ${\stackrel{\u02c7}{p}}_{ij}$ was based on statistical analysis of the text in situ, without digesting a document (or small parts of it) as a scrambled bag of words, a procedure implemented in conventional algorithms (?, ?, ?).
The empirical Markov matrix $\stackrel{\u02c7}{\mathbf{P}}={({\stackrel{\u02c7}{\mathbf{p}}}_{\mathrm{\mathbf{i}\mathbf{j}}})}_{\mathrm{\U0001d7cf}\le \mathbf{i},\mathbf{j}\le \mathbf{N}}$ has some desirable properties.