Abstract
Recurrent neural network grammars (RNNG) are generative models of languagewhich jointly model syntax and surface structure by incrementally generating asyntax tree and sentence in a topdown, lefttoright order. Supervised RNNGsachieve strong language modeling and parsing performance, but require anannotated corpus of parse trees. In this work, we experiment with unsupervisedlearning of RNNGs. Since directly marginalizing over the space of latent treesis intractable, we instead apply amortized variational inference. To maximizethe evidence lower bound, we develop an inference network parameterized as aneural CRF constituency parser. On language modeling, unsupervised RNNGsperform as well their supervised counterparts on benchmarks in English andChinese. On constituency grammar induction, they are competitive with recentneural language models that induce tree structures from words through attentionmechanisms.
Quick Read (beta)
Unsupervised Recurrent Neural Network Grammars
Abstract
Recurrent neural network grammars (RNNG) are generative models of language which jointly model syntax and surface structure by incrementally generating a syntax tree and sentence in a topdown, lefttoright order. Supervised RNNGs achieve strong language modeling and parsing performance, but require an annotated corpus of parse trees. In this work, we experiment with unsupervised learning of RNNGs. Since directly marginalizing over the space of latent trees is intractable, we instead apply amortized variational inference. To maximize the evidence lower bound, we develop an inference network parameterized as a neural CRF constituency parser. On language modeling, unsupervised RNNGs perform as well their supervised counterparts on benchmarks in English and Chinese. On constituency grammar induction, they are competitive with recent neural language models that induce tree structures from words through attention mechanisms.
section§#2#1#3 \crefformatsubsection§#2#1#3 \crefformatsubsubsection§#2#1#3 \usetikzlibrarypatterns \algrenewcommand\algorithmicindent1em \algrenewcommand\[email protected]
Unsupervised Recurrent Neural Network Grammars
Yoon Kim${}^{\mathrm{\u2020}}$ Alexander M. Rush${}^{\mathrm{\u2020}}$ Lei Yu${}^{\mathrm{\u25c7}}$ Adhiguna Kuncoro${}^{\mathrm{\u2021}\mathrm{,}\mathrm{\u25c7}}$ Chris Dyer${}^{\mathrm{\u25c7}}$ Gábor Melis${}^{\mathrm{\u25c7}}$ ${}^{\u2020}$Harvard University ${}^{\u2021}$University of Oxford ${}^{\u25c7}$DeepMind {yoonkim,srush}@seas.harvard.edu {leiyu,akuncoro,cdyer,melisgl}@google.com
1 Introduction
^{†}^{†} Work done while the first author was an intern at DeepMind. Code available at https://github.com/harvardnlp/urnngRecurrent neural network grammars (RNNGs) (Dyer et al., 2016) model sentences by first generating a nested, hierarchical syntactic structure which is used to construct a context representation to be conditioned upon for upcoming words. Supervised RNNGs have been shown to outperform standard sequential language models, achieve excellent results on parsing (Dyer et al., 2016; Kuncoro et al., 2017), better encode syntactic properties of language (Kuncoro et al., 2018), and correlate with electrophysiological responses in the human brain (Hale et al., 2018). However, these all require annotated syntactic trees for training. In this work, we explore unsupervised learning of recurrent neural network grammars for language modeling and grammar induction.
The standard setup for unsupervised structure learning is to define a generative model ${p}_{\theta}(\mathbf{x},\mathbf{z})$ over observed data $\mathbf{x}$ (e.g. sentence) and unobserved structure $\mathbf{z}$ (e.g. parse tree, partofspeech sequence), and maximize the log marginal likelihood $\mathrm{log}{p}_{\theta}(\mathbf{x})=\mathrm{log}{\sum}_{\mathbf{z}}{p}_{\theta}(\mathbf{x},\mathbf{z})$. Successful approaches to unsupervised parsing have made strong conditional independence assumptions (e.g. contextfreeness) and employed auxiliary objectives (Klein and Manning, 2002) or priors (Johnson et al., 2007). These strategies imbue the learning process with inductive biases that guide the model to discover meaningful structures while allowing tractable algorithms for marginalization; however, they come at the expense of language modeling performance, particularly compared to sequential neural models that make no independence assumptions.
Like RNN language models, RNNGs make no independence assumptions. Instead they encode structural bias through operations that compose linguistic constituents. The lack of independence assumptions contributes to the strong language modeling performance of RNNGs, but make unsupervised learning challenging. First, marginalization is intractable. Second, the biases imposed by the RNNG are relatively weak compared to those imposed by models like PCFGs. There is little pressure for nontrivial tree structure to emerge during unsupervised RNNG (URNNG) learning.
In this work, we explore a technique for handling intractable marginalization while also injecting inductive bias. Specifically we employ amortized variational inference (Kingma and Welling, 2014; Rezende et al., 2014; Mnih and Gregor, 2014) with a structured inference network. Variational inference lets us tractably optimize a lower bound on the log marginal likelihood, while employing a structured inference network encourages nontrivial structure. In particular, a conditional random field (CRF) constituency parser (Finkel et al., 2008; Durrett and Klein, 2015), which makes significant independence assumptions, acts as a guide on the generative model to learn meaningful trees through regularizing the posterior (Ganchev et al., 2010).
We experiment with URNNGs on English and Chinese and observe that they perform well as language models compared to their supervised counterparts and standard neural LMs. In terms of grammar induction, they are competitive with recentlyproposed neural architectures that discover treelike structures through gated attention (Shen et al., 2018). Our results, along with other recent work on joint language modeling/structure learning with deep networks (Shen et al., 2018, 2019; Wiseman et al., 2018; Kawakami et al., 2018), suggest that it is possible to learn generative models of language that model the underlying data well (i.e. assign high likelihood to heldout data) and at the same time induce meaningful linguistic structure.
2 Unsupervised Recurrent Neural Network Grammars
We use $\mathbf{x}=[{x}_{1},\mathrm{\dots},{x}_{T}]$ to denote a sentence of length $T$, and $\mathbf{z}\in {\mathcal{Z}}_{T}$ to denote an unlabeled binary parse tree over a sequence of length $T$, represented as a binary vector of length $2T1$. Here 0 and 1 correspond to shift and reduce actions, explained below.^{1}^{1} 1 The cardinality of ${\mathcal{Z}}_{T}\subset {\{0,1\}}^{2T1}$ is given by the $(T1)$th Catalan number, ${\mathcal{Z}}_{T}=\frac{(2T2)!}{T!(T1)!}$. Figure 1 presents an overview of our approach.
2.1 Generative Model
An RNNG defines a joint probability distribution ${p}_{\theta}(\mathbf{x},\mathbf{z})$ over sentences $\mathbf{x}$ and parse trees $\mathbf{z}$. We consider a simplified version of the original RNNG (Dyer et al., 2016) by ignoring constituent labels and only considering binary trees. The RNNG utilizes an RNN to parameterize a stack data structure (Dyer et al., 2015) of partiallycompleted constituents to incrementally build the parse tree while generating terminals. Using the current stack representation, the model samples an action (shift or reduce): shift generates a terminal symbol, i.e. word, and shifts it onto the stack,^{2}^{2} 2 A better name for shift would be generate (as in Dyer et al. (2016)), but we use shift to emphasize similarity with the shiftreduce parsing. reduce pops the last two elements off the stack, composes them, and shifts the composed representation onto the stack.
Formally, let $S=[(\mathrm{\U0001d7ce},\mathrm{\U0001d7ce})]$ be the initial stack. Each item of the stack will be a pair, where the first element is the hidden state of the stack LSTM, and the second element is an input vector, described below. We use $top(S)$ to refer to the top pair in the stack. The $push$ and $pop$ operations are defined imperatively in the usual way. At each time step, the next action ${z}_{t}$ (shift or reduce) is sampled from a Bernoulli distribution parameterized in terms of the current stack representation. Letting $({\mathbf{h}}_{\text{prev}},{\mathbf{g}}_{\text{prev}})=top(S)$, we have
$${z}_{t}\sim Bernoulli({p}_{t}),{p}_{t}=\sigma ({\mathbf{w}}^{\top}{\mathbf{h}}_{\text{prev}}+b).$$ 
Subsequent generation depend on ${z}_{t}$:

•
If ${z}_{t}=0$ (shift), the model first generates a terminal symbol via sampling from a categorical distribution whose parameters come from an affine transformation and a softmax,
$$x\sim softmax({\mathrm{\mathbf{W}\mathbf{h}}}_{\text{prev}}+\mathbf{b}).$$ Then the generated terminal is shifted onto the stack using a stack LSTM,
${\mathbf{h}}_{\text{next}}=LSTM({\mathbf{e}}_{x},{\mathbf{h}}_{\text{prev}}),$ $push(S,({\mathbf{h}}_{\text{next}},{\mathbf{e}}_{x})),$ where ${\mathbf{e}}_{x}$ is the word embedding for $x$.

•
If ${z}_{t}=1$ (reduce), we pop the last two elements off the stack,
$({\mathbf{h}}_{r},{\mathbf{g}}_{r})=pop(S),({\mathbf{h}}_{l},{\mathbf{g}}_{l})=pop(S),$ and obtain a new representation that combines the left/right constituent representations using a tree LSTM (Tai et al., 2015; Zhu et al., 2015),
$${\mathbf{g}}_{\text{new}}=TreeLSTM({\mathbf{g}}_{l},{\mathbf{g}}_{r}).$$ Note that we use ${\mathbf{g}}_{l}$ and ${\mathbf{g}}_{r}$ to obtain the new representation instead of ${\mathbf{h}}_{l}$ and ${\mathbf{h}}_{r}$.^{3}^{3} 3 The update equations for the tree LSTM (and the stack LSTM) also involve cell states in addition to the hidden states. To reduce notational clutter we do not explicitly show the cell states and instead subsume them into $\mathbf{g}$. If one (or both) of the inputs to the tree LSTM is a word embedding, the associated cell state is taken to be zero. See Tai et al. (2015) for the exact parameterization. We then update the stack using ${\mathbf{g}}_{\text{new}}$,
$({\mathbf{h}}_{\text{prev}},{\mathbf{g}}_{\text{prev}})=top(S),$ ${\mathbf{h}}_{\text{new}}=LSTM({\mathbf{g}}_{\text{new}},{\mathbf{h}}_{\text{prev}}),$ $push(S,({\mathbf{h}}_{\text{new}},{\mathbf{g}}_{\text{new}})).$
The generation process continues until an endofsentence symbol is generated. The parameters $\theta $ of the generative model are $\mathbf{w},b,\mathbf{W},\mathbf{b}$, and the parameters of the stack/tree LSTMs. For a sentence $\mathbf{x}=[{x}_{1},\mathrm{\dots},{x}_{T}]$ of length $T$, the binary parse tree is given by the binary vector $\mathbf{z}=[{z}_{1},\mathrm{\dots},{z}_{2T1}]$.^{4}^{4} 4 As it stands, the support of $\mathbf{z}$ is ${\{0,1\}}^{2T1}$, all binary vectors of length $2T1$. To restrict our distribution to ${\mathcal{Z}}_{T}$ (binary vectors which describe valid trees), we constrain ${z}_{t}$ to be valid at each time step, which amounts to deterministically choosing ${z}_{t}=0$ (shift) if there are fewer than two elements (not counting the initial zero tuple) on the stack. The joint log likelihood decomposes as a sum of terminal/action log likelihoods,
$\mathrm{log}{p}_{\theta}($  $$  
$$  (1) 
where $$ refers to all actions before generating the $t$th word, and similarly $$ refers to all words generated before taking the $j$th action. For brevity, from here on we will use $\mathrm{log}{p}_{\theta}(\mathbf{x}\mathbf{z})$ to refer to the first term (terminal log likelihood) and $$ to refer to the second term (action log likelihood) in the above decomposition.^{5}^{5} 5 The action log likelihood is the sum of log conditional priors, which is obviously different from the unconditional log prior $\mathrm{log}{p}_{\theta}(\mathbf{z})=\mathrm{log}{\sum}_{\mathbf{x}}{p}_{\theta}(\mathbf{x},\mathbf{z})$.
In the supervised case where groundtruth $\mathbf{z}$ is available, we can straightforwardly perform gradientbased optimization to maximize the joint log likelihood $\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z})$. In the unsupervised case, the standard approach is to maximize the log marginal likelihood,
$$\mathrm{log}{p}_{\theta}(\mathbf{x})=\mathrm{log}\sum _{{\mathbf{z}}^{\prime}\in {\mathcal{Z}}_{T}}{p}_{\theta}(\mathbf{x},{\mathbf{z}}^{\prime}).$$ 
However this summation is intractable because ${z}_{t}$ fully depends on all previous actions $[{z}_{1},\mathrm{\dots},{z}_{t1}]$. Even if this summation were tractable, it is not clear that meaningful latent structures would emerge given the lack of explicit independence assumptions in the RNNG (e.g. it is clearly not contextfree). We handle these issues with amortized variational inference.
2.2 Amortized Variational Inference
Amortized variational inference (Kingma and Welling, 2014) defines a trainable inference network $\varphi $ that parameterizes ${q}_{\varphi}(\mathbf{z}\mathbf{x})$, a variational posterior distribution, in this case over parse trees $\mathbf{z}$ given the sentence $\mathbf{x}$. This distribution is used to form an evidence lower bound (ELBO) on the log marginal likelihood,
$$ELBO(\theta ,\varphi ;\mathbf{x})={\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}\frac{{p}_{\theta}(\mathbf{x},\mathbf{z})}{{q}_{\varphi}(\mathbf{z}\mathbf{x})}\right].$$ 
We maximize the ELBO with respect to both model parameters $\theta $ and inference network parameters $\varphi $. The ELBO is still intractable to calculate exactly, but this formulation will allow us to obtain unbiased gradient estimators based on Monte Carlo sampling.
Observe that rearranging the ELBO gives the following optimization problem,
$$\underset{\theta ,\varphi}{\mathrm{max}}\mathrm{log}{p}_{\theta}(\mathbf{x})KL[{q}_{\varphi}(\mathbf{z}\mathbf{x})\parallel {p}_{\theta}(\mathbf{z}\mathbf{x})].$$ 
Thus, $\varphi $ is trained to match the variational posterior ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ to the true posterior ${p}_{\theta}(\mathbf{z}\mathbf{x})$, but $\theta $ is also trained to match the true posterior to the variational posterior. Indeed, there is some evidence to suggest that generative models trained with amortized variational inference (i.e. variational autoencoders) learn posterior distributions that are close to the variational family (Cremer et al., 2018).
We can use this to our advantage with an inference network that injects inductive bias. We propose to do this by using a contextfree model for the inference network, in particular, a neural CRF parser (Durrett and Klein, 2015). This choice can seen as a form of posterior regularization that limits posterior flexibility of the overly powerful RNNG generative model.^{6}^{6} 6 While it has a similar goal, this formulation differs the from posterior regularization as formulated by Ganchev et al. (2010), which constrains the distributional family via linear constraints on posterior expectations. In our case, the conditional independence assumptions in the CRF lead to a curved exponential family where the vector of natural parameters has fewer dimensions than the vector of sufficient statistics of the full exponential family. This curved exponential family is a subset of the marginal polytope of the full exponential family, but it is an intersection of both linear and nonlinear manifolds, and therefore cannot be characterized through linear constraints over posterior expectations.${}^{,}$^{7}^{7} 7 In preliminary experiments, we also attempted to learn latent trees with a transitionbased parser (which does not make explicit independence assumptions) that looks at the entire sentence. However we found that under this setup, the inference network degenerated into a local minimum whereby it always generated leftbranching trees despite various optimization strategies. Williams et al. (2018) observe a similar phenomenon in the context of learning latent trees for classification tasks. However Li et al. (2019) find that it is possible use a transitionbased parser as the inference network for dependency grammar induction, if the inference network is constrained via posterior regularization (Ganchev et al., 2010) based on universal syntactic rules (Naseem et al., 2010).
The parameterization of span scores is similar to recent works (Wang and Chang, 2016; Stern et al., 2017; Kitaev and Klein, 2018): we add position embeddings to word embeddings and run a bidirectional LSTM over the input representations to obtain the forward $[{\overrightarrow{\mathbf{h}}}_{1},\mathrm{\dots},{\overrightarrow{\mathbf{h}}}_{T}]$ and backward $[{\overleftarrow{\mathbf{h}}}_{1},\mathrm{\dots},{\overleftarrow{\mathbf{h}}}_{T}]$ hidden states. The score ${s}_{ij}\in \mathbb{R}$ for a constituent spanning ${x}_{i}$ to ${x}_{j}$ is given by,
$${s}_{ij}=MLP([{\overrightarrow{\mathbf{h}}}_{j+1}{\overrightarrow{\mathbf{h}}}_{i};{\overleftarrow{\mathbf{h}}}_{i1}{\overleftarrow{\mathbf{h}}}_{j}]).$$ 
Letting $\mathbf{B}$ be the binary matrix representation of a tree (${\mathbf{B}}_{ij}=1$ means there is a constituent spanning ${x}_{i}$ and ${x}_{j}$), the CRF parser defines a distribution over binary trees via the Gibbs distribution,
$${q}_{\varphi}(\mathbf{B}\mathbf{x})=\frac{1}{{Z}_{T}(\mathbf{x})}\mathrm{exp}\left(\sum _{i\le j}{\mathbf{B}}_{ij}{s}_{ij}\right),$$ 
where ${Z}_{T}(\mathbf{x})$ is the partition function,
$${Z}_{T}(\mathbf{x})=\sum _{{\mathbf{B}}^{\prime}\in {\mathcal{B}}_{T}}\mathrm{exp}\left(\sum _{i\le j}{\mathbf{B}}_{ij}^{\prime}{s}_{ij}\right),$$ 
and $\varphi $ denotes the parameters of the inference network (i.e. the bidirectional LSTM and the MLP). Calculating ${Z}_{T}(\mathbf{x})$ requires a summation over an exponentiallysized set ${\mathcal{B}}_{T}\subset {\{0,1\}}^{T\times T}$, the set of all binary trees over a length $T$ sequence. However we can perform the summation in $O({T}^{3})$ using the inside algorithm (Baker, 1979), shown in Algorithm 2.2. This computation is itself differentiable and amenable to gradientbased optimization. Finally, letting $f:{\mathcal{B}}_{T}\to {\mathcal{Z}}_{T}$ be the bijection between the binary tree matrix representation and a sequence of shift/reduce actions, the inference network defines a distribution over ${\mathcal{Z}}_{T}$ via ${q}_{\varphi}(\mathbf{z}\mathbf{x})\triangleq {q}_{\varphi}({f}^{1}(\mathbf{z})\mathbf{x})$. {algorithm}[t] {algorithmic}[1] \ProcedureInside$s$ \Commentscores ${s}_{ij}$ for $i\le j$ \For$i:=1$ to $T$ \Commentlength1 spans \State$\beta [i,i]=\mathrm{exp}({s}_{ii})$ \EndFor\For$\mathrm{\ell}:=1$ to $T1$ \Commentspan length \For$i:=1$ to $T\mathrm{\ell}$ \Commentspan start \State$j=i+\mathrm{\ell}$ \Commentspan end \State$\beta [i,j]={\sum}_{k=i}^{j1}\mathrm{exp}({s}_{ij})\cdot \beta [i,k]\cdot \beta [k+1,j]$ \EndFor\EndFor\Statereturn $\beta [1,T]$\Commentreturn partition function ${Z}_{T}(\mathbf{x})$ \EndProcedure
2.3 Optimization
For optimization, we use the following variant of the ELBO,
$${\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z})]+\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})],$$ 
where $\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]={\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ is the entropy of the variational posterior. A Monte Carlo estimate for the gradient with respect to $\theta $ is
$${\nabla}_{\theta}ELBO(\theta ,\varphi ;\mathbf{x})\approx \frac{1}{K}\sum _{k=1}^{K}{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(\mathbf{x},{\mathbf{z}}^{(k)}),$$ 
with samples ${\mathbf{z}}^{(1)},\mathrm{\dots},{\mathbf{z}}^{(K)}$ from ${q}_{\varphi}(\mathbf{z}\mathbf{x})$. Sampling uses the intermediate values calculated during the inside algorithm to sample split points recursively (Goodman, 1998; Finkel et al., 2006), as shown in Algorithm 2.3. The gradient with respect to $\varphi $ involves two parts. The entropy term $\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ can be calculated exactly in $O({T}^{3})$, again using the intermediate values from the inside algorithm (see Algorithm 3.1).^{8}^{8} 8 We adapt the algorithm for calculating tree entropy in PCFGs from Hwa (2000) to the CRF case. Since each step of this dynamic program is differentiable, we can obtain the gradient ${\nabla}_{\varphi}\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ using automatic differentation.^{9}^{9} 9 ${\nabla}_{\varphi}\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ can also be computed using the insideoutside algorithm and a secondorder expectation semiring (Li and Eisner, 2009), which has the same asymptotic runtime complexity but generally better constants. An estimator for the gradient with respect to ${\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z})]$ is obtained via the score function gradient estimator (Glynn, 1987; Williams, 1992),
${\nabla}_{\varphi}$  ${\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z})]$  
$={\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z}){\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}(\mathbf{z}\mathbf{x})]$  
$\approx {\displaystyle \frac{1}{K}}{\displaystyle \sum _{k=1}^{K}}\mathrm{log}{p}_{\theta}(\mathbf{x},{\mathbf{z}}^{(k)}){\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({\mathbf{z}}^{(k)}\mathbf{x}).$ 
The above estimator is unbiased but typically suffers from high variance. To reduce variance, we use a control variate derived from an average of the other samples’ joint likelihoods (Mnih and Rezende, 2016), yielding the following estimator,
$$\frac{1}{K}\sum _{k=1}^{K}(\mathrm{log}{p}_{\theta}(\mathbf{x},{\mathbf{z}}^{(k)}){r}^{(k)}){\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({\mathbf{z}}^{(k)}\mathbf{x}),$$ 
where ${r}^{(k)}=\frac{1}{K1}{\sum}_{j\ne k}\mathrm{log}{p}_{\theta}(\mathbf{x},{\mathbf{z}}^{(j)})$. This control variate worked better than alternatives such as estimates of baselines from an auxiliary network (Mnih and Gregor, 2014; Deng et al., 2018) or a language model (Yin et al., 2018).
[t] {algorithmic}[1] \ProcedureSample$\beta $ \Comment$\beta $ from running Inside(s) \State$\mathbf{B}=\mathrm{\U0001d7ce}$ \Commentbinary matrix representation of tree \State$Q=[(1,T)]$ \Commentqueue of constituents \While$Q$ is not empty \State$(i,j)=pop(Q)$ \State$\tau ={\sum}_{k=i}^{j1}\beta [i,k]\cdot \beta [k+1,j]$ \For$k:=i$ to $j1$ \Commentget distribution over splits \State${w}_{k}=(\beta [i,k]\cdot \beta [k+1,j])/\tau $ \EndFor\State$k\sim Cat([{w}_{i},\mathrm{\dots},{w}_{j1}])$ \Commentsample a split point \State${\mathbf{B}}_{i,k}=1,{\mathbf{B}}_{k+1,j}=1$ \Commentupdate $\mathbf{B}$ \If$k>i$ \Commentif left child has width $>$ 1 \State$push(Q,(i,k))$ \Commentadd to queue \EndIf\If$$ \Commentif right child has width $>$ 1 \State$push(Q,(k+1,j))$ \Commentadd to queue \EndIf\EndWhile\State$\mathbf{z}=f(\mathbf{B})$ \Comment $f:{\mathcal{B}}_{T}\to {\mathcal{Z}}_{T}$ maps matrix representation of tree to sequence of actions. \Statereturn $\mathbf{z}$ \EndProcedure
3 Experimental Setup
3.1 Data
For English we use the Penn Treebank (Marcus et al., 1993, PTB) with splits and preprocessing from Dyer et al. (2016) which retains punctuation and replaces singleton words with Berkeley parser’s mapping rules, resulting in a vocabulary of 23,815 word types.^{10}^{10} 10 https://github.com/clab/rnng Notably this is much larger than the standard PTB LM setup from Mikolov et al. (2010) which uses 10K types.^{11}^{11} 11 Both versions of the PTB data can be obtained from http://demo.clab.cs.cmu.edu/cdyer/ptblm.tar.gz. Also different from the LM setup, we model each sentence separately instead of carrying information across sentence boundaries, as the RNNG is a generative model of sentences. Hence our perplexity numbers are not comparable to the PTB LM results (Melis et al., 2018; Merity et al., 2018; Yang et al., 2018).
Since the PTB is rather small, and since the URNNG does not require annotation, we also test our approach on a subset of the one billion word corpus (Chelba et al., 2013). We randomly sample 1M sentences for training and 2K sentences for validation/test, and limit the vocabulary to 30K word types. While still a subset of the full corpus (which has 30M sentences), this dataset is two orders of magnitude larger than PTB. Experiments on Chinese utilize version 5.1 of the Chinese Penn Treebank (CTB) (Xue et al., 2005), with the same splits as in Chen and Manning (2014). Singleton words are replaced with a single $\u27e8$unk$\mathrm{\u27e9}$ token, resulting in a vocabulary of 17,489 word types. {algorithm}[t] {algorithmic}[1] \ProcedureEntropy$\beta $ \Comment$\beta $ from running Inside(s) \For$i:=1$ to $T$ \Commentinitialize entropy table \State$H[i,i]=0$ \EndFor\For$l:=1$ to $T1$ \Commentspan length \For$i:=1$ to $Tl$ \Commentspan start \State$j=i+l$ \Commentspan end \State$\tau ={\sum}_{u=i}^{j1}\beta [i,u]\cdot \beta [u+1,j]$ \For$u:=i$ to $j1$ \State${w}_{u}=(\beta [i,u]\cdot \beta [u+1,j])/\tau $ \EndFor\State$H[i,j]={\sum}_{u=i}^{j1}(H[i,u]+H[u+1,j]$ \State$\mathrm{log}{w}_{u})\cdot w{}_{u}$ \EndFor\EndFor\Statereturn $H[1,T]$\Commentreturn tree entropy $\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ \EndProcedure
3.2 Training and Hyperparameters
The stack LSTM has two layers with input/hidden size equal to 650 and dropout of 0.5. The tree LSTM also has 650 units. The inference network uses a onelayer bidirectional LSTM with 256 hidden units, and the MLP (to produce span scores ${s}_{ij}$ for $i\le j$) has a single hidden layer with a $ReLU$ nonlinearity followed by layer normalization (Ba et al., 2016) and dropout of 0.5. We share word embeddings between the generative model and the inference network, and also tie weights between the input/output word embeddings (Press and Wolf, 2016).
Optimization of the model itself required standard techniques for avoiding posterior collapse in VAEs.^{12}^{12} 12 Posterior collapse in our context means that ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ always produced trivial (always left or right branching) trees. We warmup the ELBO objective by linearly annealing (per batch) the weight on the conditional prior $$ and the entropy $\mathbb{H}[{q}_{\varphi}(\mathbf{z}\mathbf{x})]$ from 0 to 1 over the first two epochs (see equation (1) for definition of $$). This is analogous to KLannealing in VAEs with continuous latent variables (Bowman et al., 2016; Sønderby et al., 2016). We train for 18 epochs (enough for convergence for all models) with a batch size of 16 and $K=8$ samples for the Monte Carlo gradient estimators. The generative model is optimized with SGD with learning rate equal to 1, except for the affine layer that produces a distribution over the actions, which has learning rate 0.1. Gradients of the generative model are clipped at 5. The inference network is optimized with Adam (Kingma and Ba, 2015) with learning rate 0.0001, ${\beta}_{1}=0.9,{\beta}_{2}=0.999$, and gradient clipping at 1. As Adam converges significantly faster than SGD (even with a much lower learning rate), we stop training the inference network after the first two epochs. Initial model parameters are sampled from $\mathcal{U}[0.1,0.1]$. The learning rate starts decaying by a factor of 2 each epoch after the first epoch at which validation performance does not improve, but this learning rate decay is not triggered for the first eight epochs to ensure adequate training. We use the same hyperparameters/training setup for both PTB and CTB. For experiments on (the subset of) the one billion word corpus, we use a smaller dropout rate of 0.1. The baseline RNNLM also uses the smaller dropout rate.
All models are trained with an endofsentence token, but for perplexity calculation these tokens are not counted to be comparable to prior work (Dyer et al., 2016; Kuncoro et al., 2017; Buys and Blunsom, 2018). To be more precise, the inference network does not make use of the endofsentence token to produce parse trees, but the generative model is trained to generate the endofsentence token after the final reduce operation.
3.3 Baselines
We compare the unsupervised RNNG (URNNG) against several baselines: (1) RNNLM, a standard RNN language model whose size is the same as URNNG’s stack LSTM; (2) Parsing Reading Predict Network (PRPN) (Shen et al., 2018), a neural language model that uses gated attention layers to embed soft treelike structures into a neural network (and among the current stateoftheart in grammar induction from words on the full corpus); (3) RNNG with trivial trees (left branching, right branching, random); (4) supervised RNNG trained on unlabeled, binarized gold trees.^{13}^{13} 13 We use right branching binarization—Matsuzaki et al. (2005) find that differences between various binarization schemes have marginal impact. Our supervised RNNG therefore differs the original RNNG, which trains on nonbinarized trees and does not ignore constituent labels. Note that the supervised RNNG also trains a discriminative parser ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ (alongside the generative model ${p}_{\theta}(\mathbf{x},\mathbf{z})$) in order to sample parse forests for perplexity evaluation (i.e. importance sampling). This discriminative parser has the same architecture as URNNG’s inference network. For all models, we perform early stopping based on validation perplexity.
4 Results and Discussion
PTB  CTB  
Model  PPL  ${F}_{1}$  PPL  ${F}_{1}$ 
RNNLM  93.2  –  201.3  – 
PRPN (default)  126.2  32.9  290.9  32.9 
PRPN (tuned)  96.7  41.2  216.0  36.1 
Left Branching Trees  100.9  10.3  223.6  12.4 
Right Branching Trees  93.3  34.8  203.5  20.6 
Random Trees  113.2  17.0  209.1  17.4 
URNNG  90.6  40.7  195.7  29.1 
RNNG  88.7  68.1  193.1  52.3 
RNNG $\to $ URNNG  85.9  67.7  181.1  51.9 
Oracle Binary Trees  –  82.5  –  88.6 
4.1 Language Modeling
Table 1 shows perplexity for the different models on PTB/CTB. As a language model URNNG outperforms an RNNLM and is competitive with the supervised RNNG.^{14}^{14} 14 For RNNG and URNNG we estimate the log marginal likelihood (and hence, perplexity) with $K=1000$ importanceweighted samples, $\mathrm{log}{p}_{\theta}(\mathbf{x})\approx \mathrm{log}\left(\frac{1}{K}{\sum}_{k=1}^{K}\frac{\mathrm{log}p(\mathbf{x},{\mathbf{z}}^{(k)})}{{q}_{\varphi}({\mathbf{z}}^{(k)}\mathbf{x})}\right)$. During evaluation only, we also flatten ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ by dividing span scores ${s}_{ij}$ by a temperature term $2.0$ before feeding it to the CRF. The left branching baseline performs poorly, implying that the strong performance of URNNG/RNNG is not simply due to the additional depth afforded by the tree LSTM composition function (a left branching tree, which always performs reduce when possible, is the “deepest” model). The right branching baseline is essentially equivalent to an RNNLM and hence performs similarly. We found PRPN with default hyperparameters (which obtains a perplexity of 62.0 in the PTB setup from Mikolov et al. (2010)) to not perform well, but tuning hyperparameters improves performance.^{15}^{15} 15 Using the code from https://github.com/yikangshen/PRPN, we tuned model size, initialization, dropout, learning rate, and use of batch normalization. The supervised RNNG performs well as a language model, despite being trained on the joint (rather than marginal) likelihood objective.^{16}^{16} 16 RNNG is trained to maximize $\mathrm{log}{p}_{\theta}(\mathbf{x},\mathbf{z})$ while URNNG is trained to maximize (a lower bound on) the language modeling objective $\mathrm{log}{p}_{\theta}(\mathbf{x})$. This indicates that explicit modeling of syntax helps generalization even with richlyparameterized neural models. Encouraged by these observations, we also experiment with a hybrid approach where we train a supervised RNNG first and continue finetuning the model (including the inference network) on the URNNG objective (RNNG $\to $ URNNG in Table 1).^{17}^{17} 17 We finetune for 10 epochs and use a smaller learning rate of 0.1 for the generative model. This approach results in nontrivial perplexity improvements, and suggests that it is potentially possible to improve language models with supervision on parsed data.
In Figure 2 we show perplexity by sentence length. We find that a standard language model (RNNLM) is better at modeling short sentences, but underperforms models that explicitly take into account structure (RNNG/URNNG) when the sentence length is greater than 10. Table 2 (top) compares our results against prior work on this version of the PTB, and Table 2 (bottom) shows the results on a 1M sentence subset of the one billion word corpus, which is two orders of magnitude larger than PTB. On this larger dataset URNNG still improves upon the RNNLM. We also trained an RNNG (and RNNG $\to $ URNNG) on this dataset by parsing the training set with the selfattentive parser from Kitaev and Klein (2018).^{18}^{18} 18 To parse the training set we use the benepar_en2 model from https://github.com/nikitakit/selfattentiveparser, which obtains an ${F}_{1}$ score of 95.17 on the PTB test set. These models improve upon the RNNLM but not the URNNG, potentially highlighting the limitations of using predicted trees for supervising RNNGs.
PTB  PPL 

KN 5gram (Dyer et al., 2016)  169.3 
RNNLM (Dyer et al., 2016)  113.4 
Original RNNG (Dyer et al., 2016)  102.4 
Stackonly RNNG (Kuncoro et al., 2017)  101.2 
GatedAttention RNNG (Kuncoro et al., 2017)  100.9 
Generative Dep. Parser (Buys and Blunsom, 2015)  138.6 
RNNLM (Buys and Blunsom, 2018)  100.7 
Sup. Syntactic NLM (Buys and Blunsom, 2018)  107.6 
Unsup. Syntactic NLM (Buys and Blunsom, 2018)  125.2 
PRPN${}^{\u2020}$ (Shen et al., 2018)  96.7 
This work:  
RNNLM  93.2 
URNNG  90.6 
RNNG  88.7 
RNNG $\to $ URNNG  85.9 
1M Sentences  PPL 
PRPN${}^{\u2020}$ (Shen et al., 2018)  77.7 
RNNLM  77.4 
URNNG  71.8 
RNNG${}^{\u2021}$  72.9 
RNNG${}^{\u2021}$ $\to $ URNNG  72.0 
4.2 Grammar Induction
Table 1 also shows the ${F}_{1}$ scores for grammar induction. Note that we induce latent trees directly from words on the full dataset.^{19}^{19} 19 Past work on grammar induction usually train/evaluate on short sentences and also assume access to gold POS tags (Klein and Manning, 2002; Smith and Eisner, 2004; Bod, 2006). However more recent works train directly on words (Jin et al., 2018; Shen et al., 2018; Drozdov et al., 2019). For RNNG/URNNG we obtain the highest scoring tree from ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ through the Viterbi inside (i.e. CKY) algorithm.^{20}^{20} 20 Alternatively, we could estimate $\mathrm{arg}{\mathrm{max}}_{\mathbf{z}}{p}_{\theta}(\mathbf{z}\mathbf{x})$ by sampling parse trees from ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ and using ${p}_{\theta}(\mathbf{x},\mathbf{z})$ to rerank the output, as in Dyer et al. (2016). We calculate unlabeled ${F}_{1}$ using evalb, which ignores punctuation and discards trivial spans (widthone and sentence spans).^{21}^{21} 21 Available at https://nlp.cs.nyu.edu/evalb/. We evaluate with COLLINS.prm parameter file and LABELED option equal to 0. We observe that the setup for grammar induction varies widely across different papers: lexicalized vs. unlexicalized; use of punctuation vs. not; separation of train/test sets; counting sentencelevel spans for evaluation vs. ignoring them; use of additional data; length cutoff for training/evaluation; corpuslevel ${F}_{1}$ vs. sentencelevel ${F}_{1}$; and, more. In our survey of twenty or so papers, almost no two papers were identical in their setup. Such variation makes it difficult to meaningfully compare models across papers. Hence, we report grammar induction results mainly for the models and baselines considered in the present work. Since we compare ${F}_{1}$ against the original, nonbinarized trees (per convention), ${F}_{1}$ scores of models using oracle binarized trees constitute the upper bounds.
We confirm the replication study of Htut et al. (2018) and find that PRPN is a strong model for grammar induction. URNNG performs on par with PRPN on English but PRPN does better on Chinese; both outperform right branching baselines. Table 3 further analyzes the learned trees and shows the ${F}_{1}$ score of URNNG trees against other trees (left), and the recall of URNNG/PRPN trees against ground truth constituents (right). We find that trees induced by URNNG and PRPN are quite different; URNNG is more sensitive to SBAR and VP, while PRPN is better at identifying NP. While left as future work, this naturally suggests a hybrid approach wherein the intersection of constituents from URNNG and PRPN is used to create a corpus of partially annotated trees, which can be used to guide another model, e.g. via posterior regularization (Ganchev et al., 2010) or semisupervision (Hwa, 1999). Finally, Table 4 compares our results using the same evaluation setup as in Drozdov et al. (2019), which differs considerably from our setup.
Tree  PTB  CTB 

Gold  40.7  29.1 
Left  9.2  8.4 
Right  68.3  51.2 
Self  92.3  87.3 
RNNG  55.4  47.1 
PRPN  41.0  47.2 
Label  URNNG  PRPN 

SBAR  74.8$\%$  28.9$\%$ 
NP  39.5$\%$  63.9$\%$ 
VP  76.6$\%$  27.3$\%$ 
PP  55.8$\%$  55.1$\%$ 
ADJP  33.9$\%$  42.5$\%$ 
ADVP  50.4$\%$  45.1$\%$ 
${F}_{1}$  $+$PP  

PRPNUP${}^{\u2021}$  39.8  45.4 
PRPNLM${}^{\u2021}$  42.8  42.4 
ONLSTM${}^{\u2021}$ (Shen et al., 2019)  49.4  $$ 
DIORA${}^{\u2021}$ (Drozdov et al., 2019)  49.6  56.2 
PRPN (tuned)  49.0  49.9 
URNNG  52.4  52.4 
4.3 Distributional Metrics
Table 5 shows some standard metrics related to the learned generative model/inference network. The “reconstruction” perplexity based on ${\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}[\mathrm{log}{p}_{\theta}(\mathbf{x}\mathbf{z})]$ is much lower than regular perplexity, and further, the KullbackLeibler divergence between the conditional prior and the variational posterior, given by
$$ 
is highly nonzero. (See equation (1) for definitions of $\mathrm{log}{p}_{\theta}(\mathbf{x}\mathbf{z})$ and $$). This indicates that the latent space is being used in a meaningful way and that there is no posterior collapse (Bowman et al., 2016). As expected, the entropy of the variational posterior is much lower than the entropy of the conditional prior, but there is still some uncertainty in the posterior.
PTB  CTB  

RNNG  URNNG  RNNG  URNNG  
PPL  88.7  90.6  193.1  195.7 
Recon. PPL  74.6  73.4  183.4  151.9 
KL  7.10  6.13  11.11  8.91 
Prior Entropy  7.65  9.61  9.48  15.13 
Post. Entropy  1.56  2.28  6.23  5.75 
Unif. Entropy  26.07  26.07  30.17  30.17 
4.4 Syntactic Evaluation
We perform a syntactic evaluation of the different models based on the setup from Marvin and Linzen (2018): the model is given two minimally different sentences, one grammatical and one ungrammatical, and must identify the grammatical sentence by assigning it higher probability.^{22}^{22} 22 We modify the publicly available dataset from https://github.com/BeckyMarvin/LM_syneval to only keep sentence pairs that did not have any unknown words with respect to our vocabulary, resulting in 80K sentence pairs for evaluation. Further, we train on a much smaller corpus, and hence our results are not directly comparable. Table 6 shows the accuracy results. Overall the supervised RNNG significantly outperforms the other models, indicating opportunities for further work in unsupervised modeling. While the URNNG does slightly outperform an RNNLM, the distribution of errors made from both models are similar, and thus it is not clear whether the outperformance is simply due to better perplexity or learning different structural biases.
4.5 Limitations
There are several limitations to our approach. For one, the URNNG takes considerably more time/memory to train than a standard language model due to the $O({T}^{3})$ dynamic program in the inference network, multiple samples to obtain lowvariance gradient estimators, and dynamic computation graphs that make efficient batching nontrivial.^{23}^{23} 23 The main time bottleneck is the dynamic compution graph, since the dynamic programming algorithm can be batched (however the latter is a significant memory bottleneck). We manually batch the shift and reduce operation as much as possible, though recent work on autobatching (Neubig et al., 2017) could potentially make this easier/faster. The model is sensitive to hyperparameters and required various optimization strategies (e.g. separate optimizers for the inference network and the generative model) to avoid posterior collapse. Finally, the URNNG also seemed to rely heavily on punctuation to identify constituents and we were unable to improve upon a rightbranching baseline when training the URNNG on a version of PTB where punctuation is removed.^{24}^{24} 24 Many prior works that induce trees directly from words often employ additional heuristics based on punctuation (Seginer, 2007; Ponvert et al., 2011; Spitkovsky et al., 2013; Parikh et al., 2014), as punctuation (e.g. comma) is usually a reliable signal for start/end of constituent spans. The URNNG still has to learn to rely on punctuation, similar to recent works such as depthbounded PCFGs (Jin et al., 2018) and DIORA (Drozdov et al., 2019). In contrast, PRPN (Shen et al., 2018) and Ordered Neurons (Shen et al., 2019) induce trees by directly training on corpus without punctuation. We also reiterate that punctuation is used during training but ignored during evaluation (except in Table 4).
RNNLM  PRPN  RNNG  URNNG  

PPL  93.2  96.7  88.7  90.6 
Overall  62.5$\%$  61.9$\%$  69.3$\%$  64.6$\%$ 
$$ Subj.  63.5$\%$  63.7$\%$  89.4$\%$  67.2$\%$ 
$$ Obj. Rel.  62.6$\%$  61.0$\%$  67.6$\%$  65.7$\%$ 
$$ Refl.  60.7$\%$  68.8$\%$  57.3$\%$  60.5$\%$ 
$$ NPI  58.7$\%$  39.5$\%$  46.8$\%$  55.0$\%$ 
5 Related Work
There has been much work on incorporating tree structures into deep models for syntaxaware language modeling, both for unconditional (Emami and Jelinek, 2005; Buys and Blunsom, 2015; Dyer et al., 2016) and conditional (Yin and Neubig, 2017; AlvarezMelis and Jaakkola, 2017; Rabinovich et al., 2017; Aharoni and Goldberg, 2017; Eriguchi et al., 2017; Wang et al., 2018; G$\overline{\text{u}}$ et al., 2018) cases. These approaches generally rely on annotated parse trees during training and maximizes the joint likelihood of sentencetree pairs. Prior work on combining language modeling and unsupervised tree learning typically embed soft, treelike structures as hidden layers of a deep network (Cho et al., 2014; Chung et al., 2017; Shen et al., 2018, 2019). In contrast, Buys and Blunsom (2018) make Markov assumptions and perform exact marginalization over latent dependency trees. Our work is also related to the recent line of work on learning latent trees as part of a deep model through supervision on other tasks, typically via differentiable structured hidden layers (Kim et al., 2017; Bradbury and Socher, 2017; Liu and Lapata, 2018; Tran and Bisk, 2018; Peng et al., 2018; Niculae et al., 2018; Liu et al., 2018), policy gradientbased approaches (Yogatama et al., 2017; Williams et al., 2018; Havrylov et al., 2019), or differentiable relaxations (Choi et al., 2018; Maillard and Clark, 2018).
The variational approximation uses amortized inference (Kingma and Welling, 2014; Mnih and Gregor, 2014; Rezende et al., 2014), in which an inference network is used to obtain the variational posterior for each observed $\mathbf{x}$. Since our inference network is structured (i.e., a CRF), it is also related to CRF autoencoders (Ammar et al., 2014) and structured VAEs (Johnson et al., 2016; Krishnan et al., 2017), which have been used previously for unsupervised (Cai et al., 2017; Drozdov et al., 2019; Li et al., 2019) and semisupervised (Yin et al., 2018; Corro and Titov, 2019) parsing.
6 Conclusion
It is an open question as to whether explicit modeling of syntax significantly helps neural models. Strubell et al. (2018) find that supervising intermediate attention layers with syntactic heads improves semantic role labeling, while Shi et al. (2018) observe that for text classification, syntactic trees only have marginal impact. Our work suggests that at least for language modeling, incorporating syntax either via explicit supervision or as latent variables does provide useful inductive biases and improves performance.
Finally, in modeling child language acquisition, the complex interaction of the parser and the grammatical knowledge being acquired is the object of much investigation (Trueswell and Gleitman, 2007); our work shows that apparently grammatical constraints can emerge from the interaction of a constrained parser and a more general grammar learner, which is an intriguing but underexplored hypothesis for explaining human linguistic biases.
Acknowledgments
We thank the members of the DeepMind language team for helpful feedback. YK is supported by a Google Fellowship. AR is supported by NSF Career 1845664.
References
 Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards StringtoTree Neural Machine Translation. In Proceedings of ACL.
 AlvarezMelis and Jaakkola (2017) David AlvarezMelis and Tommi S. Jaakkola. 2017. Treestructured Decoding with DoublyRecurrent Neural Networks. In Proceedings of ICLR.
 Ammar et al. (2014) Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Conditional Random Field Autoencoders for Unsupervised Structured Prediction. In Proceedings of NIPS.
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. In Proceedings of NIPS.
 Baker (1979) James K. Baker. 1979. Trainable Grammars for Speech Recognition. In Proceedings of the Spring Conference of the Acoustical Society of America.
 Bod (2006) Rens Bod. 2006. An AllSubtrees Approach to Unsupervised Parsing. In Proceedings of ACL.
 Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating Sentences from a Continuous Space. In Proceedings of CoNLL.
 Bradbury and Socher (2017) James Bradbury and Richard Socher. 2017. Towards Neural Machine Translation with Latent Tree Attention. In Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing.
 Buys and Blunsom (2015) Jan Buys and Phil Blunsom. 2015. Generative Incremental Dependency Parsing with Neural Networks. In Proceedings of ACL.
 Buys and Blunsom (2018) Jan Buys and Phil Blunsom. 2018. Neural Syntactic Generative Models with Exact Marginalization. In Proceedings of NAACL.
 Cai et al. (2017) Jiong Cai, Yong Jiang, and Kewei Tu. 2017. CRF Autoencoder for Unsupervised Dependency Parsing. In Proceedings of EMNLP.
 Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv:1312.3005.
 Chen and Manning (2014) Danqi Chen and Christopher D. Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of EMNLP.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: EncoderDecoder Approaches. In Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
 Choi et al. (2018) Jihun Choi, Kang Min Yoo, and Sang goo Lee. 2018. Learning to Compose TaskSpecific Tree Structures. In Proceedings of AAAI.
 Chung et al. (2017) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical Multiscale Recurrent Neural Networks. In Proceedings of ICLR.
 Corro and Titov (2019) Caio Corro and Ivan Titov. 2019. Differentiable PerturbandParse: SemiSupervised Parsing with a Structured Variational Autoencoder. In Proceedings of ICLR.
 Cremer et al. (2018) Chris Cremer, Xuechen Li, and David Duvenaud. 2018. Inference Suboptimality in Variational Autoencoders. In Proceedings of ICML.
 Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M. Rush. 2018. Latent Alignment and Variational Attention. In Proceedings of NIPS.
 Drozdov et al. (2019) Andrew Drozdov, Patrick Verga, Mohit Yadev, Mohit Iyyer, and Andrew McCallum. 2019. Unsupervised Latent Tree Induction with Deep InsideOutside Recursive AutoEncoders. In Proceedings of NAACL.
 Durrett and Klein (2015) Greg Durrett and Dan Klein. 2015. Neural CRF Parsing. In Proceedings of ACL.
 Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. TransitionBased Dependency Parsing with Stack Long ShortTerm Memory. In Proceedings of ACL.
 Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent Neural Network Grammars. In Proceedings of NAACL.
 Emami and Jelinek (2005) Ahmad Emami and Frederick Jelinek. 2005. A Neural Syntactic Language Model. Machine Learning, 60:195–227.
 Eriguchi et al. (2017) Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to Parse and Translate Improves Neural Machine Translation. In Proceedings of ACL.
 Finkel et al. (2008) Jenny Rose Finkel, Alex Kleeman, and Christopher D. Manning. 2008. Efficient, Featurebased, Conditional Random Field Parsing. In Proceedings of ACL.
 Finkel et al. (2006) Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Ng. 2006. Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines. In Proceedings of EMNLP.
 Ganchev et al. (2010) Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior Regularization for Structured Latent Variable Models. Journal of Machine Learning Research, 11:2001–2049.
 Glynn (1987) Peter Glynn. 1987. Likelihood Ratio Gradient Estimation: An Overview. In Proceedings of Winter Simulation Conference.
 Goodman (1998) Joshua Goodman. 1998. Parsing InsideOut. PhD thesis, Harvard University.
 G$\overline{\text{u}}$ et al. (2018) Jetic G$\overline{\text{u}}$, Hassan S. Shavarani, and Anoop Sarkar. 2018. Topdown Tree Structured Decoding with Syntactic Connections for Neural Machine Translation and Parsing. In Proceedings of EMNLP.
 Hale et al. (2018) John Hale, Chris Dyer, Adhiguna Kuncoro, and Jonathan R. Brennan. 2018. Finding Syntax in Human Encephalography with Beam Search. In Proceedings of ACL.
 Havrylov et al. (2019) Serhii Havrylov, Germán Kruszewski, and Armand Joulin. 2019. Cooperative Learning of Disjoint Syntax and Semantics. In Proceedings of NAACL.
 Htut et al. (2018) Phu Mon Htut, Kyunghyun Cho, and Samuel R. Bowman. 2018. Grammar Induction with Neural Language Models: An Unusual Replication. In Proceedings of EMNLP.
 Hwa (1999) Rebecca Hwa. 1999. Supervised Grammar Induction Using Training Data with Limited Constituent Information. In Proceedings of ACL.
 Hwa (2000) Rebecca Hwa. 2000. Sample Selection for Statistical Grammar Induction. In Proceedings of EMNLP.
 Jin et al. (2018) Lifeng Jin, Finale DoshiVelez, Timothy Miller, William Schuler, and Lane Schwartz. 2018. Unsupervised Grammar Induction with Depthbounded PCFG. In Proceedings of TACL.
 Johnson et al. (2007) Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. 2007. Bayesian Inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL.
 Johnson et al. (2016) Matthew Johnson, David K. Duvenaud, Alex Wiltschko, Ryan P. Adams, and Sandeep R. Datta. 2016. Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference. In Proceedings of NIPS.
 Kawakami et al. (2018) Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2018. Unsupervised Word Discovery with Segmental Neural Language Models. arXiv:1811.09353.
 Kim et al. (2017) Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. 2017. Structured Attention Networks. In Proceedings of ICLR.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR.
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. AutoEncoding Variational Bayes. In Proceedings of ICLR.
 Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency Parsing with a SelfAttentive Encoder. In Proceedings of ACL.
 Klein and Manning (2002) Dan Klein and Christopher Manning. 2002. A Generative ConstituentContext Model for Improved Grammar Induction. In Proceedings of ACL.
 Krishnan et al. (2017) Rahul G. Krishnan, Uri Shalit, and David Sontag. 2017. Structured Inference Networks for Nonlinear State Space Models. In Proceedings of AAAI.
 Kuncoro et al. (2017) Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What Do Recurrent Neural Network Grammars Learn About Syntax? In Proceedings of EACL.
 Kuncoro et al. (2018) Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs Can Learn SyntaxSensitive Dependencies Well, But Modeling Structure Makes Them Better. In Proceedings of ACL.
 Li et al. (2019) Bowen Li, Jianpeng Cheng, Yang Liu, and Frank Keller. 2019. Dependency Grammar Induction with a Neural Variational Transitionbased Parser. In Proceedings of AAAI.
 Li and Eisner (2009) Zhifei Li and Jason Eisner. 2009. First and SecondOrder Expectation Semirings with Applications to MinimumRisk Training on Translation Forests. In Proceedings of EMNLP.
 Liu et al. (2018) Yang Liu, Matt Gardner, and Mirella Lapata. 2018. Structured Alignment Networks for Matching Sentences. In Proceedings of EMNLP.
 Liu and Lapata (2018) Yang Liu and Mirella Lapata. 2018. Learning Structured Text Representations. In Proceedings of TACL.
 Maillard and Clark (2018) Jean Maillard and Stephen Clark. 2018. Latent Tree Learning with Differentiable Parsers: ShiftReduce Parsing and Chart Parsing. In Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP.
 Marcus et al. (1993) Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330.
 Marvin and Linzen (2018) Rebecca Marvin and Tal Linzen. 2018. Targeted Syntactic Evaluation of Language Models. In Proceedings of EMNLP.
 Matsuzaki et al. (2005) Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. 2005. Probabilistic CFG with Latent Annotations. In Proceedings of ACL.
 Melis et al. (2018) Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR.
 Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and Optimizing LSTM Language Models. In Proceedings of ICLR.
 Mikolov et al. (2010) Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent Neural Network Based Language Model. In Proceedings of INTERSPEECH.
 Mnih and Gregor (2014) Andriy Mnih and Karol Gregor. 2014. Neural variational inference and learning in belief networks. In Proceedings of ICML.
 Mnih and Rezende (2016) Andriy Mnih and Danilo J. Rezende. 2016. Variational Inference for Monte Carlo Objectives. In Proceedings of ICML.
 Naseem et al. (2010) Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using Universal Linguistic Knowledge to Guide Grammar Induction. In Proceedings of EMNLP.
 Neubig et al. (2017) Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017. Onthefly Operation Batching in Dynamic Computation Graphs. arXiv:1705.07860.
 Niculae et al. (2018) Vlad Niculae, André F. T. Martins, and Claire Cardie. 2018. Towards Dynamic Computation Graphs via Sparse Latent Structure. In Proceedings of EMNLP.
 Parikh et al. (2014) Ankur P. Parikh, Shay B. Cohen, and Eric P. Xing. 2014. Spectral Unsupervised Parsing with Additive Tree Metrics. In Proceedings of ACL.
 Peng et al. (2018) Hao Peng, Sam Thomson, and Noah A. Smith. 2018. Backpropagating through Structured Argmax using a SPIGOT. In Proceedings of ACL.
 Ponvert et al. (2011) Elis Ponvert, Jason Baldridge, and Katrin Erk. 2011. Simpled Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Methods. In Proceedings of ACL.
 Press and Wolf (2016) Ofir Press and Lior Wolf. 2016. Using the Output Embedding to Improve Language Models. In Proceedings of EACL.
 Rabinovich et al. (2017) Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract Syntax Networks for Code Generation and Semantic Parsing. In Proceedings of ACL.
 Rezende et al. (2014) Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of ICML.
 Seginer (2007) Yoav Seginer. 2007. Fast Unsupervised Incremental Parsing. In Proceedings of ACL.
 Shen et al. (2018) Yikang Shen, Zhouhan Lin, ChinWei Huang, and Aaron Courville. 2018. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Proceedings of ICLR.
 Shen et al. (2019) Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In Proceedings of ICLR.
 Shi et al. (2018) Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. 2018. On Treebased Neural Sentence Modeling. In Proceedings of EMNLP.
 Smith and Eisner (2004) Noah A. Smith and Jason Eisner. 2004. Annealing Techniques for Unsupervised Statistical Language Learning. In Proceedings of ACL.
 Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. Ladder Variational Autoencoders. In Proceedings of NIPS.
 Spitkovsky et al. (2013) Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2013. Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction. In Proceedings of EMNLP.
 Stern et al. (2017) Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A Minimal SpanBased Neural Constituency Parser. In Proceedings of ACL.
 Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. LinguisticallyInformed SelfAttention for Semantic Role Labeling. In Proceedings of EMNLP.
 Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From TreeStructured Long ShortTerm Memory Networks. In Proceedings of ACL.
 Tran and Bisk (2018) Ke Tran and Yonatan Bisk. 2018. Inducing Grammars with and for Neural Machine Translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation.
 Trueswell and Gleitman (2007) John C. Trueswell and Lila R. Gleitman. 2007. Learning to Parse and its Implications for Language Acquisition. The Oxford Handbook of Psycholinguistics.
 Wang and Chang (2016) Wenhui Wang and Baobao Chang. 2016. Graphbased Dependency Parsing with Bidirectional LSTM. In Proceedings of ACL.
 Wang et al. (2018) Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. A Treebased Decoder for Neural Machine Translation. In Proceedings of EMNLP.
 Williams et al. (2018) Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018. Do Latent Tree Learning Models Identify Meaningful Structure in Sentences? In Proceedings of TACL.
 Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradientfollowing Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8.
 Wiseman et al. (2018) Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning Neural Templates for Text Generation. In Proceedings of EMNLP.
 Xue et al. (2005) Naiwen Xue, Fei Xia, Fu dong Chiou, and Marta Palmer. 2005. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11:207–238.
 Yang et al. (2018) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the Softmax Bottleneck: A HighRank RNN Language Model. In Proceedings of ICLR.
 Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for GeneralPurpose Code Generation. In Proceedings of ACL.
 Yin et al. (2018) Pengcheng Yin, Chunting Zhou, Junxian He, and Graham Neubig. 2018. StructVAE: Treestructured Latent Variable Models for Semisupervised Semantic Parsing. In Proceedings of ACL.
 Yogatama et al. (2017) Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to Compose Words into Sentences with Reinforcement Learning. In Proceedings of ICLR.
 Zhu et al. (2015) Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long ShortTerm Memory Over Tree Structures. In Proceedings of ICML.