### Abstract

We consider languages generated by weighted context-free grammars. It isshown that the behaviour of large texts is controlled by saddle-point equationsfor an appropriate generating function. We then consider ensembles of grammars,in particular the Random Language Model of E. DeGiuli, Phys. Rev. Lett., 122,128301, 2019. This model is solved in the replica-symmetric ansatz, which isvalid in the high-temperature, disordered phase. It is shown that in the phasein which languages carry information, the replica symmetry must be broken.

### Quick Read (beta)

# Emergence of order in random languages

###### Abstract

We consider languages generated by weighted context-free grammars. It is shown that the behavior of large texts is controlled by saddle-point equations for an appropriate generating function. We then consider ensembles of grammars, in particular the Random Language Model of [1]. This model is solved in the replica-symmetric ansatz, which is valid in the high-temperature, disordered phase. It is shown that in the phase in which languages carry information, the replica symmetry must be broken.

^{†}

^{†}: \jpa

Keywords: context-free grammar, language, replicas

Many complex systems have a generative, or linguistic, aspect. For example, protein structure is written in sequences of amino acids, a language of 20 different symbols. A large body of previous work has investigated the social aspect of linguistic systems, namely that different agents must find consensus regarding the meaning of symbols [2, 3, 4]. A complementary but necessary aspect of any linguistic system concerns the hidden structure within the sequences themselves, independent of communication. The most basic structural property is syntax: the rules that govern how symbols can be combined to create richer structures and thus carry information. In computer science and linguistics, generative grammar has proved to be a valuable formalism to describe syntax, in a generalized sense [5, 6, 7]. A generative grammar consists of an alphabet of hidden symbols, an alphabet of observable symbols, and a set of rules, which allow certain combinations of symbols to be replaced by others. From an initial start symbol S, one progressively applies the rules until only observable symbols remain; any sentence produced this way is said to be grammatical, and the set of all such sentences is called the language of the grammar. The sequence of rule applications is called a derivation. The Chomsky hierarchy distinguishes grammars based on the complexity of the grammatical rules. In this work, we restrict our attention to context-free grammars (CFGs), for which derivations are trees (Figure 1).

There are many theoretical results on the capabilities of CFGs [7]. However, little is known about the statistical properties of large, typical grammars. Recently, there has been increasing interest in approaching the properties of syntax from the point of view of statistical [8, 1] and quantum [9, 10, 11] physics. In [1], a simple model of random languages was proposed, using weighted context-free grammars. (We refer the reader to this work for all necessary details about generative grammars). With numerical simulations it was shown that as a ‘temperature’ ${\u03f5}_{d}$ is lowered, the model has a transition from a ‘disordered’ phase in which the grammar just produces noise, to an ‘ordered’ phase in which the grammar produces sequences with a nontrivial structure. This transition is characterized by emergence of order in the hidden structure, as measured by an order parameter ${Q}_{2}$, defined below. As the temperature is lowered through ${\u03f5}_{*}$, ${Q}_{2}$ rapidly increases. With a simple scaling argument, the location of this transition was understood as the place where energetic fluctuations begin to be comparable to entropy. However, detailed information on the transition was not obtained in [1]. In this work we use the replica method, as well as diagrammatic methods, to study random languages. We will solve the Random Language Model of [1] in the replica-symmetric phase, and show that it quantitatively captures the behavior of ${Q}_{2}$, as shown in Figure 2. We will see that the replica symmetry must be broken in the ordered phase.

## 1 Partition functions

A CFG in Chomsky Normal form is defined by two types of rules: $a\to bc$, where one hidden symbol becomes two hidden symbols, and $a\to B$, where a hidden symbol becomes an observable symbol. In a weighted CFG (WCFG), we assign weights ${M}_{abc}$ and ${O}_{aB}$, respectively, to these rules. With these weights we build a weight for an entire derivation tree, $\mathcal{T}$, as follows. We write $\sigma $ for the hidden variables, and $o$ for the observables. These are indexed by their positions on the tree. We write ${\mathrm{\Omega}}_{\mathcal{T}}$ for the set of internal factors, i.e. factors of the form $a\to bc$, and $\partial {\mathrm{\Omega}}_{\mathcal{T}}$ for the boundary factors, i.e. those associated to $a\to B$ rules. The number of boundary factors is written ${\mathrm{\ell}}_{\mathcal{T}}$, which is also the number of leaves. Since derivations are trees, the number of internal factors is ${\mathrm{\ell}}_{\mathcal{T}}-1$. Write $\mathcal{G}$ for the objects $M$ and $O$, which specify the grammar. A table of commonly appearing symbols is shown in Table 1.

Symbol | Definition | Linguistic Interpretation |
---|---|---|

$a,b,\mathrm{\dots}$ |
alphabet of hidden symbols (non-terminal symbols) |
abstract categories: noun, verb, noun phrase, … |

$S$ | start symbol | ‘sentence’ |

$A,B,\mathrm{\dots}$ | alphabet of observable symbols | words: ‘the’, ‘bear’, ‘walked’, $\mathrm{\dots}$ |

(terminal symbols) | ||

${M}_{abc}$ | weight for rule $a\to bc$ | grammatical weight |

${O}_{aB}$ | weight for rule $a\to B$ | part-of-speech weight |

${\sigma}_{i}$ | hidden symbol on site $i$ | |

${o}_{i}$ | observable symbol on site $i$ | |

${\pi}_{abc}$ | # of occurrences of rule $a\to bc$ | |

${\rho}_{aB}$ | # of occurrences of rule $a\to B$ | |

$N$ | # of hidden symbols | # of abstract categories |

$T$ | # of observable symbols | # of words in lexicon |

$m$ | # of sentences | |

$\mathrm{\ell}$ | # number of leaves | total number of words |

${\mathrm{\ell}}_{0}$ | $\mathrm{\ell}-m$ | |

${L}_{a}$ | downward msg of left-child | |

${L}_{a}^{\u2020}$ | upward msg of left-child | |

${R}_{a}$ | downward msg of right-child | |

${R}_{a}^{\u2020}$ | upward msg of right-child | |

${H}_{a}$ | ${L}_{a}^{\u2020}+{R}_{a}^{\u2020}$ | head message |

$\eta ,\zeta ,\xi $ | Lagrange multipliers to control | |

# of branches, roots, leaves | ||

${\u03f5}_{d}$ | deep temperature | inverse variance of $M$ |

${\u03f5}_{s}$ | surface temperature | inverse variance of $O$ |

${\stackrel{~}{\u03f5}}_{d}$ | ${\u03f5}_{d}/{N}^{3}$ | |

${\stackrel{~}{\u03f5}}_{s}$ | ${\u03f5}_{s}/(NT)$ |

Consider a derivation tree, such as that depicted in Figure 1a. In a weighted context-free grammar, a tree $\mathcal{T}$ with hidden variables ${\sigma}_{i}$ and observables ${o}_{t}$ has a weight

$W(\{{\sigma}_{i},{o}_{t}\}|\mathcal{T},\mathcal{G})={\displaystyle \prod _{\alpha \in {\mathrm{\Omega}}_{\mathcal{T}}}}{M}_{{\sigma}_{{\alpha}_{1}}{\sigma}_{{\alpha}_{2}}{\sigma}_{{\alpha}_{3}}}{\displaystyle \prod _{\alpha \in \partial {\mathrm{\Omega}}_{\mathcal{T}}}}{O}_{{\sigma}_{{\alpha}_{1}}{o}_{{\alpha}_{2}}},$ | (1) |

where each $\alpha =({\alpha}_{1},{\alpha}_{2},{\alpha}_{3})$ is a factor in the order ${\sigma}_{{\alpha}_{1}}\to {\sigma}_{{\alpha}_{2}}{\sigma}_{{\alpha}_{3}}$. This defines a conditional probability measure on configurations

$\mathbb{P}(\{{\sigma}_{i},{o}_{t}\}|\mathcal{T},\mathcal{G})={\displaystyle \frac{W(\{{\sigma}_{i},{o}_{t}\}|\mathcal{T},\mathcal{G})}{Z(\mathcal{T},\mathcal{G})}}$ | (2) |

where

$Z(\mathcal{T},\mathcal{G})={\displaystyle \sum _{\{{\sigma}_{i},{o}_{t}\}}}W(\{{\sigma}_{i},{o}_{t}\}|\mathcal{T},\mathcal{G})$ | (3) |

Eq.(2) specifies the probability of a derivation, given the tree. However, the topology of the tree is itself dynamical. We will consider that trees are chosen as a Bernoulli process: beginning from the root, each hidden variable either becomes an observable, with probability $p$, or branches into two hidden variables, with probability $1-p$. This implies that $\mathbb{P}(\mathcal{T}|\mathcal{G})={W}_{tree}(\mathcal{T})/{Z}_{tree}$ with ${W}_{tree}(\mathcal{T})={p}^{|\partial {\mathrm{\Omega}}_{\mathcal{T}}|}{(1-p)}^{|{\mathrm{\Omega}}_{\mathcal{T}}|}$. The ‘emission probability’ $p$ controls the size of trees ^{1}^{1}
1
${Z}_{tree}=2p/(1+|2p-1|)=1$ for $p>1/2$..
The tree-averaged partition function is then

$\mathbb{Z}(\mathcal{G};\mathrm{\ell})={\displaystyle \sum _{\mathcal{T},\mathrm{\ell}(\mathcal{T})=\mathrm{\ell}}}\mathbb{P}(\mathcal{T}|\mathcal{G})Z(\mathcal{T},\mathcal{G})$ | (4) |

as a function of the number of leaves $\mathrm{\ell}$, and the partition function for $m$ sentences of total length $\mathrm{\ell}$ is

$\mathbb{Z}(\mathcal{G};m,\mathrm{\ell})={\displaystyle \sum _{\{{\mathrm{\ell}}_{i}\},{\sum}_{i=1}^{m}{\mathrm{\ell}}_{i}=\mathrm{\ell}}}{\displaystyle \prod _{i}}\mathbb{Z}(\mathcal{G};{\mathrm{\ell}}_{i})$ | (5) |

We have ${\sum}_{i}|\partial {\mathrm{\Omega}}_{{\mathcal{T}}_{i}}|=\sum {\mathrm{\ell}}_{i}=\mathrm{\ell}$, and ${\sum}_{i}|{\mathrm{\Omega}}_{{\mathcal{T}}_{i}}|=\sum (2{\mathrm{\ell}}_{i}-1)=2\mathrm{\ell}-m$, so that ${\mathbb{Z}}_{tree}\equiv {\prod}_{i}\mathbb{P}({\mathcal{T}}_{i}|\mathcal{G})={p}^{\mathrm{\ell}}{(1-p)}^{2\mathrm{\ell}-m}{Z}_{tree}^{-m}$ just gives a trivial factor. For now we suppress dependence on $m$ and $\mathrm{\ell}$. Note that $\mathbb{Z}(\mathcal{G};m,\mathrm{\ell})$ is the weight for the grand canonical partition function ${\sum}_{m,\mathrm{\ell}\ge 0}\mathbb{Z}(\mathcal{G};m,\mathrm{\ell})$; we will, however, work at fixed $m$ and $\mathrm{\ell}$.

## 2 Energy, Entropy, and Order Parameters

It is convenient to add some auxiliary parameters in order to extract additional observables. First, we note that (1) can be written as

$W(\{{\sigma}_{i},{o}_{t}\}|\mathcal{T},\mathcal{G})={\displaystyle \prod _{a,b,c}}{M}_{abc}^{{\pi}_{abc}}{\displaystyle \prod _{a,B}}{O}_{aB}^{{\rho}_{aB}}$ | (6) |

where ${\pi}_{abc}(\sigma )$ is the usage frequency of rule $a\to bc$ and ${\rho}_{aB}(\sigma ,o)$ is the usage frequency of $a\to B$. Adding external fields ${h}_{abc}$ and ${k}_{aB}$ let us define the energy of a configuration as

$E=-{\displaystyle \sum _{a,b,c}}{\pi}_{abc}\left[{h}_{abc}+\mathrm{log}{M}_{abc}\right]-{\displaystyle \sum _{a,B}}{\rho}_{aB}\left[{k}_{aB}+\mathrm{log}{O}_{aB}\right]$ | (7) |

Then we can generalize (3) to

${Z}_{h,k}(\mathcal{T},\mathcal{G})={\displaystyle \sum _{\{{\sigma}_{i},{o}_{t}\}}}{e}^{-\beta E}$ | (8) |

where we added a bias $\beta $. The original ensemble is recovered for $\beta =1$ and $h=k=0$. We see that the average energy is

$\u27e8E\u27e9=-{\displaystyle \frac{\partial \mathrm{log}Z}{\partial \beta}}$ | (9) |

and it is natural to define the entropy of the grammar as $S=\beta \u27e8E\u27e9+\mathrm{log}Z$.

In [1] it was argued that a natural order parameter for WCFGs is one that measures the extent to which rules are applied uniformly: if all rules $a\to bc$ and $a\to B$ have the same weight, the grammar carries no information, and sentences will be indistinguishable from noise. To measure order in the deep grammar, define first

${Q}_{abc}(\mathcal{G})=\u27e8{\delta}_{{\sigma}_{{\alpha}_{1}},a}\left({N}^{2}{\delta}_{{\sigma}_{{\alpha}_{2}},b}{\delta}_{{\sigma}_{{\alpha}_{3}},c}-1\right)\u27e9={\displaystyle \frac{{N}^{2}}{{\mathrm{\ell}}_{0}}}\u27e8{\pi}_{abc}\u27e9-{\displaystyle \frac{1}{{\mathrm{\ell}}_{0}}}{\displaystyle \sum _{{b}^{\prime},{c}^{\prime}}}\u27e8{\pi}_{a{b}^{\prime}{c}^{\prime}}\u27e9,$ | (10) |

averaged over all interior vertices $\alpha $, and averaged over derivations. The normalization ${\mathrm{\ell}}_{0}=\mathrm{\ell}-m$ for $\mathbb{Z}$. A spin-glass order parameter specific to deep structure is

${Q}_{2}\equiv \overline{{\displaystyle \sum _{a,b,c}}{Q}_{abc}^{2}}={\displaystyle \frac{{N}^{5}({N}^{2}-1)}{{\mathrm{\ell}}_{0}^{2}}}({q}_{0}-{q}_{1})$ | (11) |

where

${q}_{0}=\overline{{\u27e8{\pi}_{abc}\u27e9}^{2}},{q}_{1}=\overline{\u27e8{\pi}_{abc}\u27e9\u27e8{\pi}_{a{b}^{\prime}{c}^{\prime}}\u27e9},$ | (12) |

with ${b}^{\prime}\ne b,{c}^{\prime}\ne c$. Here we used the fact that when $h=0,k=0$, the permutation symmetry is restored upon disorder averaging. We see that

$\u27e8{\pi}_{abc}\u27e9={\displaystyle \frac{1}{\beta}}{\displaystyle \frac{\partial \mathrm{log}Z}{\partial {h}_{abc}}}$ | (13) |

so ${q}_{0}$ and ${q}_{1}$ can be obtained from derivatives of $\overline{\mathrm{log}Z}$ with respect to the field.

## 3 Diagrammatic formulation

We expect that universal properties of weighted context-free languages are contained in the behavior of $\overline{\mathrm{log}\mathbb{Z}}$ when $\mathrm{\ell}$ becomes large, where $\overline{\cdot}$ is an average over grammars. In order to compute this object, we find it convenient to move to an alternative, particle, representation. In particular, we seek a model whose diagrammatic expansion gives the derivation trees we seek to count, with the appropriate weights. This technique has been widely used in the study of 2D gravity [13], and facilitated Kazakov’s solution of the Ising model on random surfaces [14, 15]. Later, it was shown in a simpler setting that this technique could be used to easily obtain results for spin models on random graphs [16, 17].

We begin with a simplified model. Consider the formal integral

$W={|G|}^{-N/2}{\displaystyle \int DU{e}^{-\frac{1}{2}{\sum}_{i,j}{U}_{i}{G}_{ij}^{-1}{U}_{j}}{e}^{\xi {\sum}_{i}{U}_{i}+\eta {\sum}_{i,j,k}{U}_{i}{U}_{j}{U}_{k}{M}_{ijk}}},$ | (14) |

where the measure is normalized such that ${|G|}^{-N/2}\int DU\mathrm{exp}(-\frac{1}{2}{\sum}_{i,j}{U}_{i}{G}_{ij}^{-1}{U}_{j})=1$. Strictly speaking, the integral is only defined by its perturbative expansion in $\eta $; convergence requires that the real part of ${G}^{-1}$ be positive-definite. This expansion generates Feynman diagrams with cubic vertices. Each vertex gets a factor $\eta {M}_{ijk}$, and the expansion with respect to $\xi $ generates sources. By Wick’s theorem, each edge gets a factor ${G}_{ij}$, the propagator. The coefficient in this expansion of ${\xi}^{m}{\eta}^{k}$ thus counts all such diagrams, possibly disconnected, with $k$ vertices and $m$ sources, times an inverse symmetry factor [18] ^{2}^{2}
2
Such factors appear when diagrams, including all their colour indices, have nontrivial symmetries, like reflections. In the disordered case where ${M}_{abc}$ depends on all indices, these symmetry factors will not play a role since typical connected graphs will have no symmetries..

This is a skeleton of what we need to count derivation trees, but there are several elements missing: first, $W$ includes all graphs with cubic vertices, not only trees. Second, even if we could restrict the sum to trees, there is nothing in $W$ to distinguish leaves from roots, or to distinguish the left and right branches from a given hidden node.

One solution to these problems is to use matrices as the integration variables, because their diagrammatic expansion can be arranged to give planar diagrams in an appropriate large $N$ limit [19, 20]. However, for our problem this is overkill. Instead, we will consider a theory of complex scalar fields with colour indices, equivalent to a complex matrix model with matrices of size 1x1. We will have two scalar fields ${L}_{a}$ and ${R}_{a}$, with colour indices $a=1\mathrm{\dots}N$. Consider

$\mathbb{F}(\mathcal{G})={\displaystyle \int DL\int DR{e}^{-\frac{1}{g}{\sum}_{a}\left[{L}_{a}{L}_{a}^{\u2020}+{R}_{a}{R}_{a}^{\u2020}\right]}{e}^{I}}$ | (15) |

where ${}^{\u2020}$ denotes complex conjugate and

$I=\zeta h({L}_{1}+{R}_{1})+\xi {\displaystyle \sum _{a}}{O}_{a}({L}_{a}^{\u2020}+{R}_{a}^{\u2020})+\eta {\displaystyle \sum _{a,b,c}}{M}_{abc}({L}_{a}^{\u2020}+{R}_{a}^{\u2020}){L}_{b}{R}_{c}.$ | (16) |

with ${O}_{a}={\sum}_{B}{O}_{aB}$. The measure $DL={\prod}_{a}dRe[{L}_{a}]dIm[{L}_{a}]/(\pi g)$ is normalized such that $\int DL{e}^{-\frac{1}{g}{\sum}_{a}{L}_{a}{L}_{a}^{\u2020}}=1$, and similarly for $R$.

The propagator is diagonal in the colour indices $a=1\mathrm{\dots}N$ and for each $a$ is such that $\u27e8{L}_{a}^{\u2020}{L}_{a}\u27e9=g$; that is, the Feynman rules are as shown in Figure 3. The diagram corresponding to Figure 1a is shown in Figure 4. We claim that, apart from accidental symmetry factors,

$\mathbb{Z}(\mathcal{G};m,\mathrm{\ell})=m!{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d\zeta}{{\zeta}^{1+m}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d\xi}{{\xi}^{1+\mathrm{\ell}}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d\eta}{{\eta}^{1+\mathrm{\ell}-m}}}\mathbb{F}(\mathcal{G}),$ | (17) |

where ${\oint}^{\prime}=\oint /(2\pi i)$. The proof is as follows.

The perturbative expansion with respect to $\eta $ generates cubic vertices with distinguished heads and $L$ and $R$ ends; the vertices can be placed on the plane such that that their heads point up. The expansion and contour integral for $\xi $ generate $\mathrm{\ell}$ sources of ${L}^{\u2020}$ and ${R}^{\u2020}$, which are the leaves. The expansion and contour integral for $\zeta $ generate $m$ sources of ${L}_{1}$ and ${R}_{1}$, which are the roots. An ${L}_{a}$ can only be connected to a ${L}_{a}^{\u2020}$, and a ${R}_{a}$ can only be connected to a ${R}_{a}^{\u2020}$. We can orient all edges from ${L}_{a}\to {L}_{a}^{\u2020}$ and ${R}_{a}\to {R}_{a}^{\u2020}$, and similarly for the half-edges in the cubic vertices, flowing from head to L and R branches. Then, from each root, we can define paths by following arrows; any path we take will go through some number of cubic vertices and end in a leaf. Therefore there are $m$ connected components, one for each root. Considered as a graph, we can count the number of edges as follows: each source generates half an edge, and each cubic vertex generates $3/2$ edges. The total number of edges is $\frac{1}{2}(m+3(\mathrm{\ell}-m)+\mathrm{\ell})=2\mathrm{\ell}-m$. The difference between the number of vertices and the number of edges is $(m+\mathrm{\ell}+\mathrm{\ell}-m)-(2\mathrm{\ell}-m)=m$, which is the number of connected components. Therefore the graph is a forest.

We thus generate a forest of $m$ planar, rooted, trees, with $\mathrm{\ell}$ leaves in total. The weight of each diagram is

${h}^{m}{g}^{2\mathrm{\ell}-m}{\displaystyle \prod _{a,b,c}}{M}_{abc}^{{\pi}_{abc}}{\displaystyle \prod _{aB}}{O}_{aB}^{{\rho}_{aB}}$ | (18) |

where ${\pi}_{abc}$ is the usage frequency of rule $a\to bc$ and ${\rho}_{aB}$ is the usage frequency of $a\to B$. This expansion counts diagrams with a degeneracy of ${2}^{m}$ since each tree root can be either an $L$ or an $R$. In the expansion, the different connected components are not ordered. We would like to distinguish forests by the order of the trees, so we multiply the result by $m!$. Choosing $g$ and $h$ such that

${(2h)}^{m}{g}^{2\mathrm{\ell}-m}={p}^{\mathrm{\ell}}{(1-p)}^{2\mathrm{\ell}-m}{Z}_{tree}^{-m}$ | (19) |

for all $\mathrm{\ell}$ and $m$ we have our result.

The virtue of working with (17) is that when $\mathrm{\ell}\to \mathrm{\infty},m/\mathrm{\ell}=\alpha =$ constant, the leading behavior can be extracted by a saddle-point analysis [21] ^{3}^{3}
3
This can be seen explicitly by considering the rescaling $L={\mathrm{\ell}}^{1/2}{L}^{\prime},R={\mathrm{\ell}}^{1/2}{R}^{\prime},\eta ={\mathrm{\ell}}^{-1/2}{\eta}^{\prime},\xi ={\mathrm{\ell}}^{1/2}{\xi}^{\prime},\zeta ={\mathrm{\ell}}^{1/2}{\zeta}^{\prime}$. . There is one subtlety. The integration variables are the real and imaginary parts of $L$ and $R$, and the saddle-point equations should be taken with respect to these parts. The solutions to $Re[{L}_{a}]$ and $Im[{L}_{a}]$, which may be complex, are then added to produce ${L}_{a}=Re[{L}_{a}]+iIm[{L}_{a}]$, and similarly for $R$. By linearity, this is equivalent to taking saddle-point equations with respect to ${L}_{a}$ and ${L}_{a}^{\u2020}$, and treating ${L}_{a}$ and ${L}_{a}^{\u2020}$ as independent. It is convenient to write ${H}_{a}={L}_{a}^{\u2020}+{R}_{a}^{\u2020}$ for a head. The saddle-point equations are

${L}_{a}$ | $=g\xi {O}_{a}+g\eta {\displaystyle \sum _{b,c}}{M}_{abc}{L}_{b}{R}_{c}$ | (20) | ||

${L}_{a}^{\u2020}$ | $=gh\zeta {\delta}_{a1}+g\eta {\displaystyle \sum _{{a}^{\prime},b,c}}{M}_{{a}^{\prime}bc}{H}_{a}{\delta}_{ab}{R}_{c}$ | (21) | ||

${R}_{a}$ | $=g\xi {O}_{a}+g\eta {\displaystyle \sum _{b,c}}{M}_{abc}{L}_{b}{R}_{c}$ | (22) | ||

${R}_{a}^{\u2020}$ | $=gh\zeta {\delta}_{a1}+g\eta {\displaystyle \sum _{{a}^{\prime},b,c}}{M}_{{a}^{\prime}bc}{H}_{a}{L}_{b}{\delta}_{ac}$ | (23) | ||

$\mathrm{\ell}-m$ | $=\eta {\displaystyle \sum _{a,b,c}}{M}_{abc}{H}_{a}{L}_{b}{R}_{c}$ | (24) | ||

$\mathrm{\ell}$ | $=\xi {\displaystyle \sum _{a}}{O}_{a}{H}_{a}$ | (25) | ||

$m$ | $=\zeta h({L}_{1}+{R}_{1})$ | (26) |

for all $a$.

(20),(21) and their pairs (22),(23) have an interpretation as recursion equations, which are equivalent to the saddle-point limit of Tutte recursion relations or loop equations in related contexts [22, 23]. They are also related to self-consistent equations derived for spin glasses on trees ^{4}^{4}
4
For example, ${L}_{a}$ is analagous to what is called $\varphi (\{{\sigma}^{a}\})$ in [16] and ${g}_{n}(\{{\sigma}_{a}\})$ in [24]. In these works, $\varphi $ and ${g}_{n}$ are functions of $n$ Ising variables, so that they can take $2n$ different values. Below, we will replicate the ${L}_{a}$ so that they take $nN$ different values; $N=2$ is the Ising case.. Indeed, any ${L}_{a}$ node can either propagate into ${L}_{a}^{\u2020}$ and become a leaf, with weight $g\xi {O}_{a}$, or propagate into ${L}_{a}^{\u2020}$ and become another branch, with weight $g\eta {\sum}_{b,c}{M}_{abc}{L}_{b}{R}_{c}$, including all possibilities. This gives (20). Similarly, any ${L}_{a}^{\u2020}$ node is either the child of a root, with weight $g\zeta h{\delta}_{b1}$, or the child of a branch, with weight $g\eta {\sum}_{{a}^{\prime},a,c}{M}_{{a}^{\prime}ac}({L}_{a}^{\u2020}+{R}_{a}^{\u2020}){R}_{c}$, including all possibilities. This gives (21).

For specific grammars, (20)-(26) can be explicitly analyzed. Indeed, after writing these equations in terms of real variables, these take the form of ‘context-free schema’ as defined in section VII.6.1 in [25]. This class of equations, mainly defined by the fact that coefficients in the right-hand side of (20)-(26) are positive, generically have square root singularities at their radius of convergence [25]. Moreover, they can be solved by iteration.

These equations are also related to cavity equations, or belief propagation equations, in the literature of disordered systems [26]. For any probabilistic model living on a tree, the cavity equations are a closed, self-consistent set of equations that can be used to compute the partition function. The variables of these equations, known as messages, are probability measures on the symbols of the graph. Here the symbols are the hidden symbols, hence each message can be considered a variable with a colour index $a=1\mathrm{\dots}N$. In general, there are two messages per site: one going into the interaction, which here is a branch, and one outgoing. For a WCFG, the interaction at a branch depends only on the values of the hidden symbols there; there is no intrinsic site-to-site disorder. Therefore for large trees, one can seek a solution to the cavity equations that is independent of site, only depending on whether sites are the left or right children of their parent. In this case, the cavity equations become (20)-(26), up to some normalization factors. Therefore the variables $L$ and $R$, which above were introduced as dummy variables to generate the diagrammatic expansion, in fact have their proper interpretation as messages: $L$ is a downward message, while ${L}^{\u2020}$ is an upward message. In an iterative scheme to solve the cavity equations, these variables have additional time indices, hence the dynamical aspect of their name; (20)-(26) are the fixed-point equations.

In what follows, our interest, however, is in extracting the behavior of typical grammars in the case when $N$ is large. For this we need to choose an ensemble.

## 4 Grammar ensembles

We will consider two models. In [1] it was argued that a generic model will have lognormally distributed weights, viz.,

${\mathbb{P}}_{G}(M,O)$ | $\equiv {Z}_{G}^{-1}J{e}^{-{\u03f5}_{d}{s}_{d}}{e}^{-{\u03f5}_{s}{s}_{s}}$ | (27) |

where the deep and surface sparsities ${s}_{d}$ and ${s}_{s}$ are defined by

${s}_{d}={\displaystyle \frac{1}{{N}^{3}}}{\displaystyle \sum _{a,b,c}}{\mathrm{log}}^{2}\left[{\displaystyle \frac{{M}_{abc}}{\overline{M}}}\right],{s}_{s}={\displaystyle \frac{1}{NT}}{\displaystyle \sum _{a,B}}{\mathrm{log}}^{2}\left[{\displaystyle \frac{{O}_{aB}}{\overline{O}}}\right]$ | (28) |

and $J={e}^{-{\sum}_{a,b,c}\mathrm{log}{M}_{abc}-{\sum}_{a,B}\mathrm{log}{O}_{aB}}$. Here $\overline{M}=1/{N}^{2}$ and $\overline{O}=1/T$. A plot of the weights for a range of ${\stackrel{~}{\u03f5}}_{d}={\u03f5}_{d}/{N}^{3}$ is shown in Fig.5a. It is straightforward to show that ${\u03f5}_{d}$ and ${\u03f5}_{s}$ satisfy

$\overline{{s}_{d}}={(2{\stackrel{~}{\u03f5}}_{d})}^{-1},\overline{{s}_{s}}={(2{\stackrel{~}{\u03f5}}_{s})}^{-1}.$ | (29) |

where $\overline{\cdot}$ denotes a grammar average and ${\stackrel{~}{\u03f5}}_{s}={\u03f5}_{s}/(NT)$. A small ‘deep temperature’ ${\u03f5}_{d}$ corresponds to a large deep sparsity. The model (27) was called in [1] the Random Language Model (RLM).

It was shown in [1] that the RLM shows two phases, depending on the value of ${\stackrel{~}{\u03f5}}_{d}$, plus logarithmic corrections. More precisely, Shannon entropies appear to collapse with respect to ${\stackrel{~}{\u03f5}}_{d}{\mathrm{log}}^{k}N$, where $k=1$ or $k=2$ depending on the quantity considered. For ${\stackrel{~}{\u03f5}}_{d}{\mathrm{log}}^{k}N\gtrsim 1$, Shannon entropies are independent of the deep temperature ${\u03f5}_{d}$, and take maximal values, indicating that the grammar does not carry information: despite strictly following the rules of a WCFG, sentences are indistinguishable from random noise. For smaller ${\u03f5}_{d}$, entropies drop, and the grammar carries nontrivial information. It is our goal to extract this transition from (17).

It will turn out to be much simpler to consider an alternative model, where the weights ${M}_{abc}$ and ${O}_{aB}$ are Gaussian, rather than lognormal; matching the mean and variance to those of the RLM we can again use the quantities $\overline{M}$ and ${\stackrel{~}{\u03f5}}_{d}$, and similarly for ${O}_{aB}$. The distribution is plotted in Fig.5b. This model has the unphysical feature that weights have a negative tail; naively, we could imagine that this would be unimportant, since the largest weights are most important, but we will have to revise this statement later. We call this the Gaussian model (GM).

We wish to compute

$\overline{\mathrm{log}\mathbb{Z}(\mathcal{G})}=\overline{{{\displaystyle \frac{\partial \mathbb{Z}{(\mathcal{G})}^{n}}{\partial n}}|}_{n=0}}={{\displaystyle \frac{\partial}{\partial n}}|}_{n=0}\overline{\mathbb{Z}{(\mathcal{G})}^{n}},$ | (30) |

where we used the replica method [27]. The fields $\zeta ,\xi ,\eta ,L$, and $R$ are all replicated, adding an index $i=1\mathrm{\dots}n$. To compute $\overline{\mathbb{F}{(\mathcal{G})}^{n}}$ we need the grammar average

$A(\xi ,\eta ,L,R)$ | $=\overline{{e}^{\xi {\sum}_{i,a}{O}_{a}{H}_{a}^{i}+\eta {\sum}_{i,a,b,c}{M}_{abc}{H}_{a}^{i}{L}_{b}^{i}{R}_{c}^{i}}}$ | |||

$={\displaystyle \prod _{a,B}}\overline{{e}^{\xi {O}_{aB}{\sum}_{i}{H}_{a}^{i}}}{\displaystyle \prod _{a,b,c}}\overline{{e}^{\eta {M}_{abc}{\sum}_{i}{H}_{a}^{i}{L}_{b}^{i}{R}_{c}^{i}}}$ | (31) |

Write ${x}_{abc}={\sum}_{i}{\eta}_{i}{H}_{a}^{i}{L}_{b}^{i}{R}_{c}^{i}$. In the RLM the grammar averages over $M$ are of the form ($m=\mathrm{log}{M}_{abc}/\overline{M}$)

$\sqrt{{\stackrel{~}{\u03f5}}_{d}/\pi}{\displaystyle \int \mathit{d}m{e}^{-{\stackrel{~}{\u03f5}}_{d}{m}^{2}}{e}^{{e}^{m}\overline{M}x}}$ | $=\sqrt{{\stackrel{~}{\u03f5}}_{d}/\pi}{\displaystyle \int \mathit{d}m{e}^{-{\stackrel{~}{\u03f5}}_{d}{m}^{2}}\sum _{q\ge 0}\frac{{\overline{M}}^{q}{x}^{q}}{q!}{e}^{qm}}$ | (32) | ||

$={\displaystyle \sum _{q\ge 0}}{\displaystyle \frac{{\overline{M}}^{q}{x}^{q}}{q!}}{e}^{{q}^{2}/(4{\stackrel{~}{\u03f5}}_{d})}$ | (33) |

A term of order $q$ corresponds to the rule appearing $q$ times. We are interested in a transition due to patterns of repeated rule application between sentences, rather than inside them (this would correspond to a transition deeper in the ordered phase). Therefore a priori we expect that we only need connected terms up to a small finite order ${q}_{*}$. Note that at order ${q}_{*}$, terms involving ${q}_{*}$ different replicas will be present. Resumming gives

$\sum _{q\ge 0}}{\displaystyle \frac{{\overline{M}}^{q}{x}^{q}}{q!}}{e}^{{q}^{2}/(4{\stackrel{~}{\u03f5}}_{d})}=\mathrm{exp}\left({\displaystyle \sum _{q\ge 1}}{a}_{q}{\overline{M}}^{q}{x}^{q}\right)$ | (34) |

with ${a}_{1}={e}^{1/(4{\stackrel{~}{\u03f5}}_{d})},{a}_{2}=\frac{1}{2}({e}^{4/(4{\stackrel{~}{\u03f5}}_{d})}-{e}^{2/(4{\stackrel{~}{\u03f5}}_{d})}),$ etc. From this divergent sum we only need to retain terms up to order $q=n(\mathrm{\ell}-m)$, since the integration over $\eta $ will retain only $\mathrm{\ell}-m$ vertices for each replica.

For practical reasons exact calculations are limited to ${q}_{*}=2$. In this case, we consider all derivations in which a rule can appear at most twice in one derivation tree. Note that rules can still appear arbitrarily many times in the set of $n$ replicas and $m$ sentences. Keeping terms to ${q}_{*}=2$ is equivalent to letting the $M$ be drawn from a Gaussian distribution. For appropriate choice of mean and variance, we can thus fix the GM to be equal to the RLM to this order. In the remainder of this work, we will first find the exact solution of the GM, and then discuss its extension to the RLM.

### 4.1 Gaussian model

Applying the same arguments to the integral over $O$, we have for the GM

$A(\xi ,\eta ,L,R)$ | $={\displaystyle \prod _{a,B}}{e}^{{b}_{1}\overline{O}{\sum}_{i}{\xi}_{i}{H}_{a}^{i}+{b}_{2}{\overline{O}}^{2}{({\sum}_{i}{\xi}_{i}{H}_{a}^{i})}^{2}}{\displaystyle \prod _{a,b,c}}{e}^{{a}_{1}\overline{M}{\sum}_{i}{\eta}_{i}{H}_{a}^{i}{L}_{b}^{i}{R}_{c}^{i}+{a}_{2}{\overline{M}}^{2}{({\sum}_{i}{\eta}_{i}{H}_{a}^{i}{L}_{b}^{i}{R}_{c}^{i})}^{2}}$ | |||

$={e}^{{b}_{1}NT\overline{O}{\sum}_{i}{\xi}_{i}{H}_{*}^{i}+{b}_{2}T{\overline{O}}^{2}{\sum}_{i,j}{\xi}_{i}{\xi}_{j}{Q}_{H}^{ij}+{a}_{1}{N}^{3}\overline{M}{\sum}_{i}{\eta}_{i}{H}_{*}^{i}{L}_{*}^{i}{R}_{*}^{i}+{a}_{2}{\overline{M}}^{2}{\sum}_{i,j}{\eta}_{i}{\eta}_{j}{Q}_{H}^{ij}{Q}_{L}^{ij}{Q}_{R}^{ij}}$ | (35) |

where we introduced ‘magnetization’ vectors and overlap matrices

${H}_{*}^{i}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{a}}{H}_{a}^{i},{L}_{*}^{i}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{a}}{L}_{a}^{i},{R}_{*}^{i}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{a}}{R}_{a}^{i},$ | (36) | ||

${Q}_{H}^{ij}={\displaystyle \sum _{a}}{H}_{a}^{i}{H}_{a}^{j},{Q}_{L}^{ij}={\displaystyle \sum _{a}}{L}_{a}^{i}{L}_{a}^{j},{Q}_{R}^{ij}={\displaystyle \sum _{a}}{R}_{a}^{i}{R}_{a}^{j},$ | (37) |

and ${b}_{1}={e}^{1/(4{\stackrel{~}{\u03f5}}_{s})},{b}_{2}=\frac{1}{2}({e}^{4/(4{\stackrel{~}{\u03f5}}_{s})}-{e}^{2/(4{\stackrel{~}{\u03f5}}_{s})})$. (Recall that ${H}_{a}={L}_{a}^{\u2020}+{R}_{a}^{\u2020}$.). Assembling the above results we find that for the GM,

$\overline{\mathbb{Z}{(\mathcal{G})}^{n}}$ | $={\displaystyle \prod _{i}}\left[m!{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\zeta}_{i}}{{\zeta}_{i}^{1+m}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\xi}_{i}}{{\xi}_{i}^{1+\mathrm{\ell}}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\eta}_{i}}{{\eta}_{i}^{1+\mathrm{\ell}-m}}}{\displaystyle \int D{L}^{i}\int D{R}^{i}{e}^{-\frac{1}{g}{\sum}_{a}[{L}_{a}^{i}{L}_{a}^{i}{}^{\u2020}+{R}_{a}^{i}{R}_{a}^{i}{}^{\u2020}]}}\right]$ | ||

$\mathrm{\hspace{1em}\hspace{1em}}{e}^{{\sum}_{i}{\zeta}_{i}h({L}_{1}^{i}+{R}_{1}^{i})}A(\xi ,\eta ,L,R)$ |

This is now in the form amenable to standard treatment by replicas: the overlap matrices can be introduced as new parameters, and the original variables can be integrated out. We notice that the colour indices play the role usually played by spatial indices in spin glasses [27]. The surprising result is that the model can be exactly integrated, without even making an ansatz on the replica structure, and without taking the large $\mathrm{\ell}$ limit. This integrability can be traced to the fact that the overlap matrices depend only on the real and imaginary parts of $L$ in the canonical way, i.e. through ${L}_{a}^{i}=Re[{L}_{a}^{i}]+iIm[{L}_{a}^{i}]$. This gives rise to a symplectic structure that simplifies the integration over these variables. The derivation is sketched in Appendix A. The final result is

$\overline{\mathrm{log}\mathbb{Z}}$ | $=m\mathrm{log}h+(2\mathrm{\ell}-m)\mathrm{log}g+{S}_{\mathrm{\ell},m}+\mathrm{\ell}\mathrm{log}(T\overline{O}{b}_{1})+{\mathrm{\ell}}_{0}\mathrm{log}({N}^{2}\overline{M}{a}_{1})$ | (38) | ||

$+\mathrm{\ell}f({x}_{s})+{\mathrm{\ell}}_{0}f({x}_{d})$ |

with ${\mathrm{\ell}}_{0}=\mathrm{\ell}-m$,

$f(x)={\displaystyle \int \frac{dt}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{t}^{2}}\mathrm{log}\left[1+xt\right]},$ | (39) |

and

${S}_{\mathrm{\ell},m}=\mathrm{log}{\mathrm{\ell}}_{0}!-\mathrm{log}\mathrm{\ell}!+\mathrm{log}(1+\mathrm{\ell}+{\mathrm{\ell}}_{0})!-\mathrm{log}(1+2{\mathrm{\ell}}_{0})!$ | (40) |

Here ${x}_{d}={N}^{-3/2}\sqrt{{e}^{1/(2{\stackrel{~}{\u03f5}}_{d})}-1}$ and ${x}_{s}={(NT)}^{-1/2}\sqrt{{e}^{1/(2{\stackrel{~}{\u03f5}}_{s})}-1}$.

The elements of $\overline{\mathrm{log}\mathbb{Z}}$ are as follows. Those involving $h$ and $g$ give $\mathrm{log}{\mathbb{Z}}_{tree}$, the contribution to $\mathrm{log}\mathbb{Z}$ that weights configurations by the number of sentences and leaves; these are precisely those required from 18. ${S}_{\mathrm{\ell},m}$ is entropic, and apparently counts the number of ways to partition a total string of length $\mathrm{\ell}$ into $m$ sentences. Terms with ${b}_{j}$ and ${a}_{j}$ are energetic, since they depend on the grammar weight distribution. Those involving ${b}_{1}$ and ${a}_{1}$ capture the change in the mean occupancy as temperature is varied. The function $f(x)$ captures the non-trivial effects of correlation between different symbols. This function, which can written in terms of hypergeometric functions, is plotted in Fig.6. It develops an imaginary part for $x\sim \mathcal{O}(1)$, indicating that the unphysical negative probability states are becoming important. For large $N$ the condition ${x}_{d}\lesssim \mathcal{O}(1)$ is equivalent to

${\stackrel{~}{\u03f5}}_{d}\mathrm{log}N\gtrsim 1/6,$ | (41) |

and similarly the condition ${x}_{s}\lesssim \mathcal{O}(1)$ is equivalent to ${\stackrel{~}{\u03f5}}_{s}\mathrm{log}NT\gtrsim 1/2$. These inequalities fix the regime in which the GM is physical.

### 4.2 RLM

We now return to the full model. As discussed above, we cannot obtain the exact solution; however, since a saddle-point method is justified for large $\mathrm{\ell}$, we can consider different ansatze on the form of the solution. Two are natural: (i) the colour-symmetric ansatz ${L}_{a}^{i}={L}^{i},{R}_{a}^{i}={R}^{i}$ $\forall a$, and the (ii) replica-symmetric ansatz ${L}_{a}^{i}={L}_{a},{R}_{a}^{i}={R}_{a}$ $\forall i$. After some calculations similar to those for the GM, we eventually find for either (i) or (ii) the same form (38), except that $f\equiv 0$. Besides the ansatze on the form of $L$ and $R$, we assume that $\mathrm{\ell}$ is large and that the replica limit $n\to 0$ can be taken perturbatively, i.e. keeping terms $\mathcal{O}(n)$ in the action, as in the usual approach [27] ^{5}^{5}
5
Consistency with the GM suggests that we should be able to recover (38) with $f\equiv 0$ for that model, without necessarily taking the limit ${\stackrel{~}{\u03f5}}_{d}\to \mathrm{\infty},{\stackrel{~}{\u03f5}}_{s}\to \mathrm{\infty}$, in which case the $f$ terms are trivially unimportant. Indeed, if instead of taking $n\to 0$ in (49), we look for a saddle-point with large $\mathrm{\ell}$, the saddle-point perturbative in $n$ gives exactly (38) with $f\equiv 0$. This indicates that the function $f$ is nonperturbative in the replica limit.. We now analyze (38) with $f\equiv 0$, with the understanding that this holds in the replica-symmetric regime, whose range of validity is to be determined.

It is convenient to separate $F=-\overline{\mathrm{log}\mathbb{Z}}$ into its entropic and energetic contributions. This can be done exactly because the RLM has a scaling symmetry when the bias $\beta $ is included. Indeed, it is not hard to show that the partition function satisfies the scaling property $\overline{{\mathbb{Z}}^{n}}(\beta ,{\stackrel{~}{\u03f5}}_{d},\overline{M},{\stackrel{~}{\u03f5}}_{s},\overline{O})=\overline{{\mathbb{Z}}^{n}}(1,{\stackrel{~}{\u03f5}}_{d}/{\beta}^{2},{\overline{M}}^{\beta},{\stackrel{~}{\u03f5}}_{s}/{\beta}^{2},{\overline{O}}^{\beta})$ (in abuse of earlier notation). The $\beta -$dependent part of $F$ is

${F}_{\beta}=-\beta \mathrm{\ell}\mathrm{log}\overline{O}-\beta {\mathrm{\ell}}_{0}\mathrm{log}\overline{M}-{\displaystyle \frac{{\beta}^{2}\mathrm{\ell}}{4{\stackrel{~}{\u03f5}}_{s}}}-{\displaystyle \frac{{\beta}^{2}{\mathrm{\ell}}_{0}}{4{\stackrel{~}{\u03f5}}_{d}}}$ | (42) |

so that at $\beta =1$, $E=-\mathrm{\ell}\mathrm{log}\overline{O}-{\mathrm{\ell}}_{0}\mathrm{log}\overline{M}-\frac{\mathrm{\ell}}{2{\stackrel{~}{\u03f5}}_{s}}-\frac{{\mathrm{\ell}}_{0}}{2{\stackrel{~}{\u03f5}}_{d}}$ and the replica-symmetric entropy is

${S}_{RS}={\mathrm{\ell}}_{0}\mathrm{log}(g{N}^{2}/h)+\mathrm{\ell}\mathrm{log}(gTh)-{\displaystyle \frac{\mathrm{\ell}}{4{\stackrel{~}{\u03f5}}_{s}}}-{\displaystyle \frac{{\mathrm{\ell}}_{0}}{4{\stackrel{~}{\u03f5}}_{d}}}+{S}_{\mathrm{\ell},m}$ | (43) |

The entropy cannot be negative; this gives necessary, though perhaps not sufficient, conditions on the regime where the replica-symmetric ansatz is applicable. For simplicity, consider the case $\mathrm{\ell}\to \mathrm{\infty},m/\mathrm{\ell}=\alpha \ll 1$ ($\mathrm{\ell}/m$ is the typical length of a tree; we let it be large). Then one can determine that ${S}_{\mathrm{\ell},m}=\mathcal{O}(\alpha \mathrm{\ell})$ and, for the simulated case where the emission probability is close to $p=1/2$, $g=h\approx 1/\sqrt{8}$. The condition ${S}_{RS}>0$ is approximately equivalent to

$\frac{1}{4{\stackrel{~}{\u03f5}}_{d}}}+{\displaystyle \frac{1}{4{\stackrel{~}{\u03f5}}_{s}}}\lesssim \mathrm{log}({N}^{2}T/8)$ | (44) |

Our main concern is the emergence of deep structure, which does not depend on what happens at the surface of the tree. In the limit ${\stackrel{~}{\u03f5}}_{s}\to \mathrm{\infty}$, this becomes ${\stackrel{~}{\u03f5}}_{d}\mathrm{log}({N}^{2}T/8)\gtrsim 1/4$, very similar to the regime in which the GM is physical, 41.

### 4.3 Order parameter

Finally, we return to the order parameter ${Q}_{2}$ that measures deep structure. Let us first give a heuristic derivation of its value in the RLM. We need to compute ${q}_{0}=\overline{{\u27e8{\pi}_{abc}\u27e9}^{2}}$ and ${q}_{1}=\overline{\u27e8{\pi}_{abc}\u27e9\u27e8{\pi}_{a{b}^{\prime}{c}^{\prime}}\u27e9}$, where ${\pi}_{abc}$ is the occupancy of the rule $a\to bc$. Using (7) and (13) in the diagrammatic representation, one can see that ${\pi}_{abc}={e}^{{h}_{abc}}\eta {M}_{abc}{H}_{a}{L}_{b}{R}_{c}$. The ${\pi}_{abc}$ satisfy the sum rule ${\sum}_{abc}{\pi}_{abc}={\mathrm{\ell}}_{0}$ and so have a mean value $\overline{\pi}={\mathrm{\ell}}_{0}/{N}^{3}$. The occupancies are positively correlated with the grammar weights, since rules with higher weights are sampled more frequently. A crude estimate is then $\u27e8{\pi}_{abc}\u27e9/\overline{\pi}\sim {M}_{abc}/(\overline{M}{a}_{1})$ (the mean value of a weight is $\overline{M}{a}_{1}$). This leads to the estimate

${Q}_{2}$ | $\sim {\displaystyle \frac{{N}^{5}({N}^{2}-1)}{{\mathrm{\ell}}_{0}^{2}}}{\displaystyle \frac{{\mathrm{\ell}}_{0}^{2}}{{N}^{6}{\overline{M}}^{2}{a}_{1}^{2}}}\overline{{M}_{abc}^{2}-{M}_{abc}{M}_{a{b}^{\prime}{c}^{\prime}}}$ | |||

$={\displaystyle \frac{{N}^{2}-1}{N}}\left({e}^{1/(2{\stackrel{~}{\u03f5}}_{d})}-1\right),$ | (45) |

which indicates that order increases as ${\u03f5}_{d}$ is lowered simply because the weight variance increases. ${Q}_{2}$ can be computed more precisely using replicas. After a long computation, assuming the replica symmetric ansatz and taking large $\mathrm{\ell}$, one eventually finds exactly the same result, (45). Thus this simple expression is in fact the genuine replica symmetric result; it is plotted in Fig.2, where it is compared with numerical data ^{6}^{6}
6
These data have been obtained by the same methods as described in [1]. Here we have simulated many more samples ($\mathrm{\ell}\sim {10}^{5}$ compared to $\mathrm{\ell}\sim {10}^{3}$ in that work) in order to resolve the large ${\u03f5}_{d}$ part of the curve.. In the large $\mathrm{\ell}$ limit, it matches quantitatively, without fitting parameters, above ${\stackrel{~}{\u03f5}}_{d}\mathrm{log}N\gtrsim 1$. For smaller ${\stackrel{~}{\u03f5}}_{d}$, the data asymptote, as they must, while the replica-symmetric prediction diverges.

## 5 Conclusion

We showed that the partition function for weighted context-free grammars has a convenient diagrammatic representation. For individual grammars, the behavior of a large text $\mathrm{\ell}\gg 1$ is governed by saddle-point equations, which resemble belief-propagation equations [26].

We then considered two ensembles of grammars, which are equivalent in a large temperature limit. The Gaussian model (GM) was solved exactly, and shown to become unphysical for ${\stackrel{~}{\u03f5}}_{d}\mathrm{log}N\lesssim 1/6$. For the random language model (RLM), previously simulated in [1], the partition function was computed in the replica-symmetric ansatz; the entropy becomes negative at low temperature, again depending essentially on the quantity ${\stackrel{~}{\u03f5}}_{d}\mathrm{log}N$. Finally, the order parameter ${Q}_{2}$ was computed in the replica-symmetric ansatz. The prediction quantitatively agrees with simulations above ${\stackrel{~}{\u03f5}}_{d}\mathrm{log}N\gtrsim 1$. These results indicate that replica-symmetry must be broken in the nontrivial low-temperature phase.

The RLM bears some similarity to a spin-glass on the Bethe lattice, a difficult problem that is still not fully understood [28, 29]. Indeed, both problems can be generated by a diagrammatic method [17], and in both problems one finds that overlaps of all orders are needed to compute the partition function [30, 28, 24, 29]. However, for the spin-glass, one can perform an expansion around the mean-field limit [30, 28, 24], which is the Sherrington-Kirkpatrick model solved by Parisi [27]. Naïvely, the analogue to the SK model would be the Gaussian model, which we solved above. However, we showed that this model does not break the replica symmetry. This is related to a gauge symmetry in the diagrammatic formulation. It is therefore an open question whether there is a more primitive model that captures the essence of random languages in the low-temperature phase, and remains solvable.

Finally, we have focussed here on context-free grammars, for which derivations are trees. The next level up in the Chomsky hierarchy are context-sensitive grammars. A theorem of Kuroda [31] says that it is sufficient to add rules of the form $ab\to cd$ to those above to model all context-sensitive grammars. Clearly this will add a quartic vertex to our (16), which is not in itself a difficulty. However, well-formed derivations must be represented by planar diagrams, so that the order of symbols is preserved in the derivation. Generating random planar graphs that are not trees requires matrices as integration variables; this strongly suggests that general grammars require the full machinery of complex matrix models.

## References

- [1] DeGiuli E 2019 Phys. Rev. Lett. 122 128301
- [2] Loreto V and Steels L 2007 Nature Physics 3 758
- [3] Loreto V, Baronchelli A, Mukherjee A, Puglisi A and Tria F 2011 Journal of Statistical Mechanics: Theory and Experiment 2011 P04006 1742–5468
- [4] Burridge J 2017 Physical Review X 7 031008
- [5] Carnie A 2013 Syntax: A generative introduction (John Wiley & Sons, Ltd.)
- [6] Chomsky N 2002 Syntactic structures (Berlin: Walter de Gruyter)
- [7] Hopcroft J E, Motwani R and Ullman J D 2007 Introduction to automata theory, languages, and computation 3rd ed (Boston, Ma: Pearson)
- [8] Lin H W and Tegmark M 2017 Entropy 19 299
- [9] Piattelli-Palmarini M and Vitiello G 2015 arXiv preprint arXiv:1506.08663
- [10] Gallego A J and Orus R 2017 arXiv preprint arXiv:1708.01525
- [11] Pestun V and Vlassopoulos Y 2017 arXiv preprint arXiv:1710.10248
- [12] Searls D B 2002 Nature 420 211
- [13] Di Francesco P, Ginsparg P and Zinn-Justin J 1995 Physics Reports 254 1–133
- [14] Kazakov V 1986 Physics Letters A 119 140–144
- [15] Boulatov D and Kazakov V 1987 Physics Letters B 186 379–384 0370–2693
- [16] Bachas C, De Calan C and Petropoulos P 1994 Journal of Physics A: Mathematical and General 27 6121
- [17] Baillie C, Janke W, Johnston D and Plecháč P 1995 Nuclear Physics B 450 730–752
- [18] Bessis D, Itzykson C and Zuber J B 1980 Advances in Applied Mathematics 1 109–157
- [19] t’Hooft G 1974 Nuclear Physics. B 72 461–473
- [20] Brezin E, Itzykson C, Parisi G and Zuber J 1978 Commun. math. Phys 59 35–51
- [21] Le Guillou J C and Zinn-Justin J 2012 Large-order behaviour of perturbation theory vol 7 (Elsevier)
- [22] Di Francesco P 2006 2D quantum gravity, matrix models and graph combinatorics (Springer) pp 33–88
- [23] Eynard B 2016 Counting surfaces (Springer)
- [24] Parisi G and Tria F 2002 The European Physical Journal B-Condensed Matter and Complex Systems 30 533–541
- [25] Flajolet P and Sedgewick R 2009 Analytic combinatorics (cambridge University press)
- [26] Mezard M and Montanari A 2009 Information, physics, and computation (Oxford University Press)
- [27] Mézard M, Parisi G and Virasoro M 1987 Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications vol 9 (World Scientific Publishing Company)
- [28] Mézard M and Parisi G 2001 The European Physical Journal B-Condensed Matter and Complex Systems 20 217–233
- [29] Parisi G 2017 Journal of Statistical Physics 167 515–542 0022–4715
- [30] De Dominicis C and Goldschmidt Y 1989 Journal of Physics A: Mathematical and General 22 L775
- [31] Kuroda S Y 1964 Information and Control 7 207–223
- [32] Mackey D S and Mackey N 2003 On the determinant of symplectic matrices (Manchester Centre for Computational Mathematics)

## Appendix A Solution of Gaussian model

We introduce ${Q}_{L}^{ij}={\sum}_{a}{L}_{a}^{i}{L}_{a}^{j}$ as a new variable with a corresponding momentum ${\mathrm{\Lambda}}_{L}^{ij}$, and similarly for ${Q}_{R}^{ij}={\sum}_{a}{R}_{a}^{i}{R}_{a}^{j}$ and ${P}^{ij}={\sum}_{a}{L}_{a}^{i}{R}_{a}^{j}$, with conjugate momenta ${\mathrm{\Lambda}}_{R}$ and ${\mathrm{\Lambda}}_{P}$, respectively. Let us write ${x}_{a}^{i}=Re[{L}_{a}^{i}],{y}_{a}^{i}=Im[{L}_{a}^{i}]$. The variables $\{{x}_{a}^{i},{y}_{a}^{i}\}$ are Gaussian, with a coupling matrix diagonal in colour. For each $a$, the coupling matrix is a $2n\times 2n$ matrix acting on $({x}_{a}^{1},\mathrm{\dots}{x}_{a}^{n},{y}_{a}^{1},\mathrm{\dots},{y}_{a}^{n})$,

$Z=\left[\begin{array}{cc}\hfill \widehat{\delta}-g{\widehat{\mathrm{\Lambda}}}_{L}\hfill & \hfill -gi{\widehat{\mathrm{\Lambda}}}_{L}\hfill \\ \hfill -gi{\widehat{\mathrm{\Lambda}}}_{L}\hfill & \hfill \widehat{\delta}+g{\widehat{\mathrm{\Lambda}}}_{L}\hfill \end{array}\right],$ | (46) |

where $\widehat{\delta}$ is the $n\times n$ identity matrix. It is easily verified that $Z$ is complex symplectic: $J={Z}^{t}\cdot J\cdot Z$, where

$J=\left[\begin{array}{cc}\hfill 0\hfill & \hfill \widehat{\delta}\hfill \\ \hfill -\widehat{\delta}\hfill & \hfill 0\hfill \end{array}\right]$ | (47) |

This implies that ${Z}^{-1}={J}^{-1}\cdot {Z}^{t}\cdot J$ and, less obviously, $|Z|=1$ [32]. Hence after integrating out $\{{L}_{a}^{i}\}$ there is no nontrivial entropic term from $\mathrm{log}|Z|$, nor does there appear the inverse of ${\mathrm{\Lambda}}_{L}$, as would naively be expected. In fact, after integrating out $\{{L}_{a}^{i}\}$ and $\{{R}_{a}^{i}\}$ the action remains linear in ${\mathrm{\Lambda}}_{L}$, ${\mathrm{\Lambda}}_{R}$, and ${\mathrm{\Lambda}}_{P}$. Hence these can be immediately integrated out and we find that

${Q}_{L}^{ij}=N{L}_{*}^{i}{L}_{*}^{j},{Q}_{R}^{ij}=N{R}_{*}^{i}{R}_{*}^{j},{P}^{ij}=N{L}_{*}^{i}{R}_{*}^{j}$ | (48) |

We find

$\overline{\mathbb{Z}{(\mathcal{G})}^{n}}$ | $={\displaystyle \prod _{i}}\left[m!{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\zeta}_{i}}{{\zeta}_{i}^{1+m}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\xi}_{i}}{{\xi}_{i}^{1+\mathrm{\ell}}}}{\displaystyle {\oint}^{\prime}}{\displaystyle \frac{d{\eta}_{i}}{{\eta}_{i}^{1+\mathrm{\ell}-m}}}{\displaystyle \int \mathit{d}{L}_{*}^{i}\int \mathit{d}{R}_{*}^{i}{e}^{-\frac{N}{g}{\sum}_{a}[{L}_{*}^{i}{L}_{*}^{i}{}^{\u2020}+{R}_{*}^{i}{R}_{*}^{i}{}^{\u2020}]}}\right]$ | ||

$\mathrm{\hspace{1em}\hspace{1em}}{e}^{{\sum}_{i}{\zeta}_{i}h({L}_{*}^{i}+{R}_{*}^{i})}{e}^{{\stackrel{~}{b}}_{1}{\sum}_{i}{\xi}_{i}{H}_{*}^{i}+{\stackrel{~}{b}}_{2}{\left({\sum}_{i}{\xi}_{i}{H}_{*}^{i}\right)}^{2}+{\stackrel{~}{a}}_{1}{\sum}_{i}{\eta}_{i}{H}_{*}^{i}{L}_{*}^{i}{R}_{*}^{i}+{\stackrel{~}{a}}_{2}{\left({\sum}_{i}{\eta}_{i}{H}_{*}^{i}{L}_{*}^{i}{R}_{*}^{i}\right)}^{2}}$ |

with ${\stackrel{~}{b}}_{j}=NT{\overline{O}}^{j}{b}_{j},{\stackrel{~}{a}}_{j}={N}^{3}{\overline{M}}^{j}{a}_{j}$, $j=1,2$. The quadratic terms can be linearized with a Hubbard-Stratonovich transformation, after which the integrals over $\zeta ,\xi $, and $\eta $ are simple. The result is

$\overline{\mathbb{Z}{(\mathcal{G})}^{n}}$ | $={\displaystyle \frac{1}{2\pi}}{\displaystyle \int \mathit{d}p{e}^{-\frac{1}{2}{p}^{2}}\int \mathit{d}q{e}^{-\frac{1}{2}{q}^{2}}{\left[\int \mathit{d}L\int \mathit{d}RK(L,R,p,q)\right]}^{n}}$ | (49) |

with

$K(L,R,p,q)$ | $={\displaystyle \frac{{h}^{m}{(\pi g/N)}^{-2}}{\mathrm{\ell}!(\mathrm{\ell}-m)!}}{\left({\stackrel{~}{b}}_{1}+p\sqrt{2{\stackrel{~}{b}}_{2}}\right)}^{\mathrm{\ell}}{\left({\stackrel{~}{a}}_{1}+q\sqrt{2{\stackrel{~}{a}}_{2}}\right)}^{\mathrm{\ell}-m}$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}{e}^{-\frac{N}{g}\left[L{L}^{\u2020}+R{R}^{\u2020}\right]}{\left({L}^{\u2020}+{R}^{\u2020}\right)}^{2\mathrm{\ell}-m}{(LR)}^{\mathrm{\ell}-m}$ | (50) |

The calculation is thus reduced to an effective single-colour problem. We have

$\frac{{h}^{m}{(\pi g/N)}^{-2}}{\mathrm{\ell}!(\mathrm{\ell}-m)!}}{\displaystyle \int \mathit{d}L\int \mathit{d}R{e}^{-\frac{N}{g}\left[L{L}^{\u2020}+R{R}^{\u2020}\right]}{\left({L}^{\u2020}+{R}^{\u2020}\right)}^{2\mathrm{\ell}-m}{(LR)}^{\mathrm{\ell}-m}$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}={\displaystyle \frac{{h}^{m}}{\mathrm{\ell}!(\mathrm{\ell}-m)!}}{(g/N)}^{2\mathrm{\ell}-m}{\displaystyle \sum _{k=0}^{m}}\left({\displaystyle \genfrac{}{}{0pt}{}{m}{k}}\right)(\mathrm{\ell}-m+k)!(\mathrm{\ell}-k)!$ | |||

$\mathrm{\hspace{1em}\hspace{1em}}={\displaystyle \frac{{h}^{m}}{\mathrm{\ell}!}}{(g/N)}^{2\mathrm{\ell}-m}(\mathrm{\ell}-m)!{\displaystyle \frac{(1+2\mathrm{\ell}-m)!}{(1+2\mathrm{\ell}-2m)!}}$ | (51) |

so that the final result is

$\overline{\mathrm{log}\mathbb{Z}}$ | $=m\mathrm{log}h+(2\mathrm{\ell}-m)\mathrm{log}(g/N)+{S}_{\mathrm{\ell},m}+\mathrm{\ell}\mathrm{log}{\stackrel{~}{b}}_{1}+(\mathrm{\ell}-m)\mathrm{log}{\stackrel{~}{a}}_{1}$ | |||

$+\mathrm{\ell}f\left(\sqrt{2{\stackrel{~}{b}}_{2}/{\stackrel{~}{b}}_{1}^{2}}\right)+(\mathrm{\ell}-m)f\left(\sqrt{2{\stackrel{~}{a}}_{2}/{\stackrel{~}{a}}_{1}^{2}}\right)$ | (52) |

with

$f(x)={\displaystyle \int \frac{dt}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{t}^{2}}\mathrm{log}\left[1+xt\right]}$ | (53) |

and ${S}_{\mathrm{\ell},m}=\mathrm{log}(\mathrm{\ell}-m)!-\mathrm{log}\mathrm{\ell}!+\mathrm{log}(1+2\mathrm{\ell}-m)!-\mathrm{log}(1+2\mathrm{\ell}-2m)!$. Letting ${x}_{d}=\sqrt{2{\stackrel{~}{a}}_{2}/{\stackrel{~}{a}}_{1}^{2}}={N}^{-3/2}\sqrt{{e}^{1/(2{\stackrel{~}{\u03f5}}_{d})}-1}$ and ${x}_{s}={(NT)}^{-1/2}\sqrt{{e}^{1/(2{\stackrel{~}{\u03f5}}_{s})}-1}$, we have the result shown in the main text.