Abstract
We introduce three memoryaugmented Recurrent Neural Networks (MARNNs) andexplore their capabilities on a series of simple language modeling tasks whosesolutions require stackbased mechanisms. We provide the first demonstration ofneural networks recognizing the generalized Dyck languages, which express thecore of what it means to be a language with hierarchical structure. Ourmemoryaugmented architectures are easy to train in an endtoend fashion andcan learn the Dyck languages over as many as six parenthesispairs, in additionto two deterministic palindrome languages and the stringreversal transductiontask, by emulating pushdown automata. Our experiments highlight the increasedmodeling capacity of memoryaugmented models over simple RNNs, while inflectingour understanding of the limitations of these models.
Quick Read (beta)
MemoryAugmented Recurrent Neural Networks
Can Learn Generalized Dyck Languages
Abstract
We introduce three memoryaugmented Recurrent Neural Networks (MARNNs) and explore their capabilities on a series of simple language modeling tasks whose solutions require stackbased mechanisms. We provide the first demonstration of neural networks recognizing the generalized Dyck languages, which express the core of what it means to be a language with hierarchical structure. Our memoryaugmented architectures are easy to train in an endtoend fashion and can learn the Dyck languages over as many as six parenthesispairs, in addition to two deterministic palindrome languages and the stringreversal transduction task, by emulating pushdown automata. Our experiments highlight the increased modeling capacity of memoryaugmented models over simple RNNs, while inflecting our understanding of the limitations of these models.
MemoryAugmented Recurrent Neural Networks
Can Learn Generalized Dyck Languages
Mirac Suzgun^{1} Sebastian Gehrmann^{1} Yonatan Belinkov^{12} Stuart M. Shieber^{1} ^{1} Harvard John A. Paulson School of Engineering and Applied Sciences ^{2} MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA, USA {[email protected],{gehrmann,belinkov,shieber}@seas}.harvard.edu
1 Introduction
Recurrent Neural Networks (RNNs) have proven to be an effective and powerful model choice for capturing longdistance dependencies and complex representations in sequential tasks, such as language modeling (Mikolov et al., 2010; Sundermeyer et al., 2012), machine translation (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2014), and speech recognition (Graves et al., 2013). In theory, RNNs with rational state weights and infinite numeric precision are known to be computationally universal models (Siegelmann and Sontag, 1994, 1995). Yet, in practice, the computational power of RNNs with finite numeric precision is still unknown. Hence, the classes of languages that can be learned, empirically or theoretically, by RNNs with finite numeric precision are still to be discovered.
A natural question arises, then, as to what extent RNN models can learn languages that exemplify important formal properties found in natural languages, such properties as longdistance dependencies, counting, hierarchy, and repetition.
Along these lines, Gers and Schmidhuber (2001); Weiss et al. (2018); Suzgun et al. (2019a) have demonstrated that Long ShortTerm Memory (LSTM; Hochreiter and Schmidhuber (1997)), a popular variant of RNNs, can develop counting mechanisms to recognize simple strictly contextfree and contextsensitive languages, such as ${a}^{n}{b}^{n}$ and ${a}^{n}{b}^{n}{c}^{n}$, as evidenced by analysis of the hidden state values.^{1}^{1} 1 From an automatatheoretic perspective, such languages are expressible with simple oneturn counter machines. By contrast, Weiss et al. have shown that Gated Recurrent Units (GRUs; Cho et al. (2014)), another popular variant of RNNs, cannot perform this type of counting and provided an explanation for some of the difference in performance between LSTMs and GRUs.
Merrill (2019) studied the theoretical expressiveness of various realtime neural networks with finite precision under asymptotic conditions, showing that RNNs and GRUs can capture regular languages whereas LSTMs can further recognize a subset of realtime counter languages. And empirically, Suzgun et al. (2019b) demonstrated that LSTM networks can learn to perform dynamic counting, as exemplified by the wellbalanced parenthesis (Dyck) language ${\mathcal{D}}_{1}$ as well as the shuffles of multiple ${\mathcal{D}}_{1}$ languages.
Counting in realtime is an important property, differentiating some of the language classes in the Chomsky hierarchy, and echoes of it appear in natural languages as well, for instance, in the requirement that the number of arguments of a set of verbs match their subcategorization requirements. But counting does not exhaust the kinds of structural properties that may be apposite for natural language. Chomsky (1957) emphasizes the hierarchical structure found in natural languages, for instance, in the nested matching of “both …and” and “either …or”. Indeed, this kind of nestedmatching phenomenon forms the essence of the strictly contextfree languages (CFLs). Here simple (realtime) counters are not sufficient; a stack is required. The formallanguagetheory reflex of this phenomenon is found most sparely in the Dyck languages ${\mathcal{D}}_{n}$ of wellnested strings over $n$ pairs of brackets, where $n>1$. (We refer to these as the ${\mathcal{D}}_{>1}$ languages.)
The centrality of this nested stack structure in characterizing the class of contextfree languages can be seen in various ways. (i) The automatatheoretic analog of contextfree grammars, the pushdown automaton, is defined by its use of a stack (Chomsky, 1962). (ii) Chomsky and Schützenberger (1963) famously showed that all contextfree languages are homomorphic images of regularintersected ${\mathcal{D}}_{n}$ languages.^{2}^{2} 2 In particular, ${\mathcal{D}}_{2}$ is sufficient for this purpose (Magniez et al., 2014; Suzgun et al., 2019b). (iii) The hardest CFL of Greibach (1973) and the hardest deterministic CFL of Sudborough (1976) are built using Dycklanguagestyle matching. For these reasons, we think of the ${\mathcal{D}}_{>1}$ languages as expressing the core of what it means to be a contextfree language with hierarchical structure, even if it is not itself a universal CFL. This property of the Dyck languages accounts for the heavy focus on the them in prior work (Deleu and Dureau, 2016; Bernardy, 2018; Sennhauser and Berwick, 2018; Skachkova et al., 2018; Hao et al., 2018; Zaremba et al., 2016; Suzgun et al., 2019a; Yu et al., 2019; Hahn, 2019) as well as in this work. It would thus be notable for finite precision neural networks to learn languages, like the ${\mathcal{D}}_{>1}$ languages and other languages requiring a stack, if we want these neural architectures to be able to manifest hierarchical structures.
In this paper, we introduce three enhanced RNN models that consist of recurrent layers and external memory structures, namely stackaugmented RNNs (StackRNNs), stackaugmented LSTMs (StackLSTMs), and Baby Neural Turing Machines (BabyNTMs), and show that they can effectively learn to recognize some ${\mathcal{D}}_{>1}$ languages from limited data by emulating deterministic pushdown automata. Previous studies used simple RNN models (Bernardy, 2018; Sennhauser and Berwick, 2018; Suzgun et al., 2019b; Yu et al., 2019) and memoryaugmented architectures (Hao et al., 2018) to attempt to learn ${\mathcal{D}}_{2}$ under different training platforms; however, none of them were able to obtain good performance on this task. We thus present the first demonstration that a memoryaugmented neural network (MARNN) can learn ${\mathcal{D}}_{>1}$ languages. Moreover, we evaluate the learning capabilities of our architectures on six tasks whose solutions require the employment of stackbased approaches, namely learning the ${\mathcal{D}}_{2}$, ${\mathcal{D}}_{3}$ and ${\mathcal{D}}_{6}$ languages, recognizing the deterministic palindrome language ($w\mathrm{\#}{w}^{R})$ and the deterministic homomorphic palindrome language ($w\mathrm{\#}\phi ({w}^{R})$), and performing the stringreversal transduction task ($w{\mathrm{\#}}^{w}\Rightarrow {\mathrm{\#}}^{w}{w}^{R}$). Our results reflect the better modeling capacity of our MARNNs over the standard RNN and LSTM models in capturing hierarchical representations, in addition to providing an insightful glimpse of the promise of these models for realworld naturallanguage processing tasks.^{3}^{3} 3 Our code is available at https://github.com/suzgunmirac/marnns.
2 Related Work
2.1 Learning Formal Languages Using Neural Networks
Using neural network architectures to recognize formal languages has been a central computational task in gaining an understanding of their expressive ability for application to naturallanguageprocessing tasks. Elman (1991) marked the beginning of such methodological investigations and devised an artificial language learning platform where Simple Recurrent Networks (SRNs) (Elman, 1990), were trained to learn the hierarchical and recursive relationships between clauses. An analysis of the hidden state dynamics revealed that the models learned internal representations that encoded information about the grammatical structure and dependencies of the synthetic language. Later, Das et al. (1992) introduced the first RNN model with an external stack, the Recurrent Neural Network Pushdown Automaton (NNPDA), to learn simple deterministic contextfree grammars.
Following Elman’s work, many studies used SRNs (Steijvers, 1996; Tonkes and Wiles, 1997; Hölldobler et al., 1997; Rodriguez and Wiles, 1998; Bodén et al., 1999; Bodén and Wiles, 2000; Rodriguez, 2001) and stackbased RNNs (Das et al., 1993; Zeng et al., 1994) to recognize simple contextfree and contextsensitive counter languages, including ${a}^{n}{b}^{n}$, ${a}^{n}{b}^{n}c{b}^{m}{a}^{m}$, ${a}^{n}{b}^{m}{B}^{m}{A}^{n}$, ${a}^{n+m}{b}^{n}{c}^{m}$, ${a}^{n}{b}^{n}{c}^{n}$, ${(b{a}^{n})}^{m}$, and ${\mathcal{D}}_{1}$. Nonetheless, none of the SRNbased models were able to generalize far beyond their training set. Some of these studies also focused on understanding and visualizing the internal representations learned by the hidden units of the networks, as well as the computational capabilities and limitations of the models.
In contrast, Gers and Schmidhuber (2001), Schmidhuber et al. (2002), and Gers et al. (2002), showed that their (LSTM) networks could not only competently learn two strictly contextfree languages, ${a}^{n}{b}^{n}$ and ${a}^{n}{b}^{m}{B}^{m}{A}^{n}$, and one strictly contextsensitive language ${a}^{n}{b}^{n}{c}^{n}$, but also generalize far beyond the training datasets.
2.2 MemoryAugmented Neural Networks
Recently, memoryaugmented architectures have been considered for language modeling tasks: Joulin and Mikolov (2015) proposed a differentiable stack structure controlled by an RNN to infer algorithmic patterns that require some combination of counting and memorization. Though their model could learn ${a}^{n}{b}^{n}$, ${a}^{n}{b}^{n}{c}^{n}$, ${a}^{n}{b}^{n}{c}^{n}{d}^{n}$, ${a}^{n}{b}^{2n}$, ${a}^{n}{b}^{m}{c}^{n+m}$, it did not exceed the performance of a standard LSTM on a language modeling task. Inspired by the early architecture design of NNPDA, Grefenstette et al. (2015) introduced LSTM models equipped with unbounded differentiable memory structures, such as stacks, queues, and doublelinked lists, and explored their computational power on synthetic transduction tasks. In their experiments, their NeuralStack and NeuralQueue architectures outperformed the standard LSTM architectures. Neither of these studies on stackaugmented neural architectures inspected the internal representations learned by the recurrent hidden layers or investigated the performance of their models on the Dyck language.
Graves et al. (2014) introduced the Neural Turing Machine (NTM), which consists of a neural network (which can be either feedforward or recurrent) together with a differentiable external memory, and demonstrated its successful performance on a series of simple algorithmic tasks, such as copying, repeated copying, and sorting. At each time step, an NTM can interact with the external memory via its differentiable attention mechanisms and determine its output using the information from the current hidden state together with the filtered context from the external memory. It is evident from the design differences that the degree of freedom of NTMs is much greater than that of stackaugmented recurrent networks. However, this freedom comes at a price: The different ways in which we can attend to the memory to read and write at each time step make the training of the neural models challenging. Since the publication of the original NTM paper, there have been a number of studies addressing instability issues of the NTM architecture, or more broadly memoryaugmented recurrent network models, and proposing new architecture designs. We refer to Zaremba and Sutskever (2015); Kurach et al. (2015); Yang (2016); Graves et al. (2016); Gulcehre et al. (2018) for such proposals.
2.3 Investigations of the Dyck Languages
Deleu and Dureau (2016) used NTMs to capture longdistance dependencies in ${\mathcal{D}}_{1}$. Their examination showed that NTMs indeed learn to emulate stack representations and generalize to longer sequences. However, a model need not be equipped with a stack to recognize this simplest Dyck language in a standard learning environment; counting is sufficient for an automaton to capture ${\mathcal{D}}_{1}$ (Suzgun et al., 2019b; Yu et al., 2019).
In assessing the ability of recurrent neural networks to process deep and longdistance dependencies, Skachkova et al. (2018) and Sennhauser and Berwick (2018) conducted experiments on the Dyck languages to see whether LSTMs could learn nested structures. The former sought to predict the single correct closing parenthesis, given a Dyck word without its final closing symbol. Although LSTMs performed almost perfectly in this completion task, one cannot draw any definitive conclusion about whether these models really learn the Dyck languages, since even counter automata can achieve perfect accuracy on this task.^{4}^{4} 4 For instance, a model can learn to separately count the number of left and right parentheses for each of the $n$ distinct pairs and predict the closing parenthesis for the pair for which the counter is nonzero. Suzgun et al. (2019b) and Yu et al. (2019) also discuss the drawbacks of Skachkova et al.’s learning task and argue that the task is insufficient for illustrating that a network can learn ${\mathcal{D}}_{2}$.
Similarly, Bernardy (2018) used three different recurrent networks, namely LSTM, GRU, and RUSS, and combinations thereof, to predict the next possible parenthesis at each time step, assuming that it is a closing parenthesis. His RUSS model is a purposedesigned model containing recurrent units with stacklike states and appears to generalize well to deeper and longer sequences. However, as the author mentions, the specificity of the RUSS architecture disqualifies it as a practical model choice for realworld language modeling tasks.
Hao et al. (2018) studied the interpretability of Neural Stack models (Grefenstette et al., 2015) in a number of simple language modeling tasks, including parenthesis prediction, string reversal, and XOR evaluation. Though their Neural Stacks exhibited intuitive stack behaviors on their contextfree transduction tasks and performed almost as well as the standard LSTM models, the authors noted that their stackaugmented models were more difficult to train than the traditional LSTMs.
More recently, Suzgun et al. (2019b) corroborated the theoretical findings of Weiss et al. (2018) by showing that RNN, GRU, and LSTM models could perform dynamic counting by recognizing ${\mathcal{D}}_{1}$ as well as shuffles of multiple ${\mathcal{D}}_{1}$ languages by emulating simple $k$counter machines, while being incapable of recognizing ${\mathcal{D}}_{2}$.
Adopting the experimental framework of Sennhauser and Berwick (2018) and the data generation procedure of Skachkova et al. (2018), Yu et al. (2019) conducted experiments on ${\mathcal{D}}_{2}$ under different training schemes and objectives using relatively large bidirectional LSTM models. Their recurrent networks failed to generalize well beyond the scope of their training data to learn ${\mathcal{D}}_{2}$ under the closingparenthesis completion and sequencetosequence settings.^{5}^{5} 5 We note that the authors attempted to generate the shortest proper sequence of closing parentheses given a prefix of a Dyck word under the sequencetosequence framework. This task is different from the previous tasks and requires a generative model.
Finally, Hahn (2019) used ${\mathcal{D}}_{2}$ to explore the theoretical limitations of selfattention architectures (Vaswani et al., 2017). He demonstrated that selfattention models, even when equipped with infinite precision, cannot capture ${\mathcal{D}}_{2}$, unless the number of layers or attention heads increases with the length of the input sequence.
In summary, recognizing the Dyck languages has been an important probing task for understanding the ability of neural networks to capture hierarchical information. Thus far, none of the recurrent neural networks have been shown to capture ${\mathcal{D}}_{>1}$. This present work, therefore, provides the first demonstration of RNNbased models learning ${\mathcal{D}}_{>1}$, in particular, ${\mathcal{D}}_{2}$, ${\mathcal{D}}_{3}$, and ${\mathcal{D}}_{6}$, in addition to other difficult contextfree languages.
3 Models
In this section, we describe the mathematical formulations of our memoryaugmented RNNs.The inspiration for our stackaugmented neural architectures came from the pushdown automaton, an abstract machine capable of recognizing contextfree languages. Similar stackbased neural networks, however, have also been proposed by others (Pollack, 1991; Das et al., 1992; Joulin and Mikolov, 2015; Grefenstette et al., 2015). Our models differ from them in their theoretical simplicity and empirical success. Our BabyNTM, on the other hand, can be considered as a simplification of the original NTM architecture (Graves et al., 2014): As opposed to using softattention mechanisms to read and write to the external memory, we make deterministic decisions and always read content from and write to the first entry of the memory, thereby making the learning process easier while retaining universal expressivity.
Notation
We will assume the following notation:

•
$x={x}_{1},\mathrm{\dots},{x}_{T}$: The input sequence of onehot vectors, with the $i$th token ${x}_{i}$.

•
${y}_{i}$: The output associated with ${x}_{i}$.

•
$\mathbf{W}$: The learnable weights of the model.

•
$\mathbf{b}$: The learnable bias terms of the model.

•
${\mathbf{h}}_{i}$: The $i$th hidden state representation.

•
$D$: The dim. of the input and output samples.

•
$H$: The dim. of the hidden state of the model.

•
$M$: The dim. of the external stack/memory.
3.1 StackRNN
Before we begin describing our StackRNN, recall the formulation of a standard RNN:
${\mathbf{h}}_{t}$  $=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\mathbf{h}}_{(t1)}+{\mathbf{b}}_{hh})$  
${y}_{t}$  $=f({\mathbf{W}}_{y}{\mathbf{h}}_{t})$ 
where ${x}_{t}\in {\mathbb{R}}^{D}$ is the input, ${\mathbf{h}}_{t}\in {\mathbb{R}}^{H}$ the hidden state, ${y}_{t}\in {\mathbb{R}}^{D}$ the output at time $t$, ${\mathbf{W}}_{y}\in {\mathbb{R}}^{D\times H}$ the linear output layer, and $f$ a transformation.
While designing our StackRNN, we come across two important questions: (i) Where and how should we place the stack in the neural network, and (ii) how should we design the stack so that we can backpropagate errors through the stack at the time of training? Regarding (i), we place the stack in such a way that it interacts with the hidden layers at each time step. The benefit of this approach is that errors made in future stages of the model affect and backpropagate through not only the hidden states but also the stack states. Regarding (ii), we construct a differentiable stack structure. Figure 1 provides a visualization of the StackRNN. Its formulation is:
${\stackrel{~}{\mathbf{h}}}_{(t1)}$  $={\mathbf{h}}_{(t1)}+{\mathbf{W}}_{sh}{s}_{(t1)}^{(0)}$  
${\mathbf{h}}_{t}$  $=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\stackrel{~}{\mathbf{h}}}_{(t1)}+{\mathbf{b}}_{hh})$  
${y}_{t}$  $=\sigma ({\mathbf{W}}_{y}{\mathbf{h}}_{t})$  
${a}_{t}$  $=\text{softmax}({\mathbf{W}}_{a}{\mathbf{h}}_{t})$  
${n}_{t}$  $=\sigma ({\mathbf{W}}_{n}{\mathbf{h}}_{t})$  
${s}_{t}^{(0)}$  $={a}_{t}^{(0)}{n}_{t}+{a}_{t}^{(1)}{s}_{(t1)}^{(1)}$  
${s}_{t}^{(i)}$  $={a}_{t}^{(0)}{s}_{(t1)}^{(i1)}+{a}_{t}^{(1)}{s}_{(t1)}^{(i+1)}$ 
where ${s}_{t}={s}_{t}^{(0)}{s}_{t}^{(1)}\mathrm{\cdots}{s}_{t}^{(k)}$ is the stack configuration at time step $t$, with ${s}_{t}^{(0)}$ the topmost stack element; ${\mathbf{W}}_{sh}\in {\mathbb{R}}^{H\times M}$, ${\mathbf{W}}_{y}\in {\mathbb{R}}^{D\times H}$, ${\mathbf{W}}_{a}\in {\mathbb{R}}^{2\times H}$, and ${\mathbf{W}}_{n}\in {\mathbb{R}}^{M\times H}$ are all learnable linear weights of the model.
At each time step, we combine the topmost stack element ${s}_{(t1)}^{(0)}$ with the previous hidden state ${\mathbf{h}}_{(t1)}$ via a linear mapping to produce an intermediate hidden state ${\stackrel{~}{\mathbf{h}}}_{(t1)}$. We then use ${\stackrel{~}{\mathbf{h}}}_{(t1)}$, together with the input, to generate the current hidden state ${\mathbf{h}}_{t}$, from which both the output at that time step ${y}_{t}$ and the weights of the PUSH (${a}_{t}^{(0)}$) and POP (${a}_{t}^{(1)}$) operations by the stack controller are determined simultaneously. Here ${a}_{t}\in {\mathbb{R}}^{2}$ is a probability distribution over the two operations. Finally, we update the stack elements in such a way that the elements become the weighted linear interpolation of both possible stack operations. We can, therefore, consider the elements in the stack as variables in superposition states.
We highlight the following differences between our StackRNN and the StackRNN by Joulin and Mikolov (2015), as further explicated in the appendix. First, their model does not contain the term ${\stackrel{~}{\mathbf{h}}}_{(t1)}$ and it updates ${\mathbf{h}}_{t}$ as follows:
${\mathbf{h}}_{t}$  $=\sigma ({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{W}}_{hh}{\mathbf{h}}_{(t1)}+{\mathbf{W}}_{sh}{s}_{(t1)}^{(0:k)})$ 
where ${\mathbf{W}}_{sh}\in {\mathbb{R}}^{H\times k}$ and ${s}_{(t1)}^{(0:k)}$ the $k$topmost elements of the stack at time $t1$. But a simple analysis of our StackRNN formulation divulges that ${s}_{(t1)}^{(0)}$ depends on both ${\mathbf{W}}_{hh}$ and ${\mathbf{W}}_{sh}$ in our formulation, whereas it only depends on ${\mathbf{W}}_{sh}$ in Joulin and Mikolov’s formulation. Furthermore, their architecture takes the sigmoid of the linear combination of ${x}_{t}$, ${\mathbf{h}}_{(t1)}$, and ${s}_{(t1)}^{(0:k)}$, in addition to excluding the bias terms, to update ${\mathbf{h}}_{t}$.
3.2 StackLSTM
The StackLSTM is similar to the StackRNN but contains additional components of the standard LSTM architecture by Hochreiter and Schmidhuber (1997). In this model, we update the hidden state of the model according to the standard LSTM equations, that is ${\mathbf{h}}_{t}=\text{LSTM}({x}_{t},{\stackrel{~}{\mathbf{h}}}_{(t1)})$.
3.3 BabyNTM
The BabyNTM is both an extension of the StackRNN and a simplification of the original NTM. While the StackRNN contains an unbounded stack mechanism, it can perform only two basic operations on the stack, namely the PUSH and POP actions. In the BabyNTM architecture, we fix the size of the external memory but provide more freedom to the model: While the interaction between the controller and the memory in the design of the BabyNTM is mostly similar to that of the StackRNN, we allow five operations on the memory at each time step to update its contents: ROTATERIGHT, ROTATELEFT, NOOP, POPLEFT, and POPRIGHT. Suppose that the current memory configuration $\mathbf{M}$ is $[a,b,c,d,e]$, where ${\mathbf{M}}^{(i)}\in \mathbb{R}$. Then the operations produce the following configurations at the next time step:
ROTATERIGHT  $:[e,a,b,c,d].$  
ROTATELEFT  $:[b,c,d,e,a].$  
NOOP  $:[a,b,c,d,e].$  
POPRIGHT  $:[0,a,b,c,d].$  
POPLEFT  $:[b,c,d,e,0].$ 
If we think of the memory as a set $\mathbf{M}$ sitting on an $n$dimensional Euclidean space ${\mathbb{R}}^{n}$, we can then think of these operations as $n\times n$ matrices. From an algebraic point of view, we can realize these actions on the memory as leftmonoid actions on a set, since matrix multiplication is associative and the matrix corresponding to the operation NOOP serves the role of an identity element in our computations. Below is the formulation of the BabyNTM architecture:
${\stackrel{~}{\mathbf{h}}}_{(t1)}$  $={\mathbf{h}}_{(t1)}+{\mathbf{W}}_{m}{\mathbf{M}}_{(t1)}^{(0)}$  
${\mathbf{h}}_{t}$  $=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\stackrel{~}{\mathbf{h}}}_{(t1)}+{\mathbf{b}}_{hh})$  
${y}_{t}$  $=\sigma ({\mathbf{W}}_{y}{\mathbf{h}}_{t})$  
${a}_{t}$  $=\text{softmax}({\mathbf{W}}_{a}{\mathbf{h}}_{t})$  
${n}_{t}$  $=\sigma ({\mathbf{W}}_{n}{\mathbf{h}}_{t})$  
${\mathbf{M}}_{t}$  $={\displaystyle \sum _{i=1}^{N}}{a}_{t}^{(i)}\cdot \left[{\mathrm{\mathbf{O}\mathbf{P}}}^{(i)}\right]{\mathbf{M}}_{t1}$  
${\mathbf{M}}_{t}^{(0)}$  $={\mathbf{M}}_{t}^{(0)}+{n}_{t}$ 
where ${\mathbf{M}}_{t}$ denotes the memory configuration at time step $t$, ${n}_{t}$ the value of the element to be inserted to the first entry of the memory at time step $t$, ${\mathrm{\mathbf{O}\mathbf{P}}}^{(i)}$ the matrix corresponding to the $i$th action on the memory and ${a}_{t}^{(i)}$ the weight of that action at time $t$, and all $\mathbf{W}$’s learnable matrices of the model.^{6}^{6} 6 As before, the memory here can be considered as the linear superposition of the results of all the memory operations.
3.4 Softmax Functions
The softmax function in the calculation of ${a}_{t}$ in all these models enables us to map the values of the vector ${\mathbf{W}}_{a}{\mathbf{h}}_{t}$ to a categorical probability distribution. We investigate the effect of more deterministic decisions about the stack/memory operations on the robustness of our model. A natural approach is to employ a softmax function with varying temperature $\tau $:
$\text{softmaxtemp}({x}_{i},\tau )={\displaystyle \frac{\mathrm{exp}({x}_{i}/\tau )}{{\sum}_{j=1}^{N}\mathrm{exp}({x}_{j}/\tau )}}$ 
The softmaxtemp function behaves exactly like the standard softmax function when the temperature value $\tau $ equals $1$. As the temperature increases, softmaxtemp produces more uniform categorical class probabilities, whereas as the temperature decreases, the function outputs more discrete probabilities, like a onehot encoding.
Sample  $([])$  $abc\mathrm{\#}zyx$  

Input  $($  $[$  $]$  $)$  $a$  $b$  $c$  $\mathrm{\#}$  $z$  $y$  $x$  
Output  $(/[/)$  $(/[/]$  $(/[/)$  $(/[$  $a/b/c/\mathrm{\#}$  $a/b/c/\mathrm{\#}$  $a/b/c/\mathrm{\#}$  $z$  $y$  $x$  $\u22a3$ 
Furthermore, Jang et al. (2016) proposed an efficient and differentiable approximation to sampling from a discrete categorical distribution using a reparameterization trick:
$\text{Gumbelsoftmaxtemp}({x}_{i},\{{g}_{1},\mathrm{\dots},{g}_{N}\},\tau )$  
$={\displaystyle \frac{\mathrm{exp}((\mathrm{log}{x}_{i}+{g}_{i})/\tau )}{{\sum}_{j=1}^{N}\mathrm{exp}((\mathrm{log}{x}_{j}+{g}_{j})/\tau )}}$ 
where ${g}_{i}\stackrel{\text{i.i.d.}}{\sim}\text{Gumbel}(0,1)$. As an alternative to the softmax function with varying temperature, one might want to use the Gumbelsoftmax sampling method. In cases where we have more than two operations on the stack/memory, it might be tempting to prefer the Gumbelsoftmax sampling approach for the calculation of ${a}_{t}$ values. We experiment with these alternatives below.
4 Experimental Setup
To evaluate the performance of the MARNNs, we conducted experiments on six computational tasks whose solutions require the formation of stack structures. In all the experiments, we used both standard and memoryaugmented RNNs to explore the differences in their performances, in addition to the StackRNN model by Joulin and Mikolov (2015), and repeated each experiment $10$ times. Furthermore, we aimed to investigate softmax functions with varying temperature in our MARNNs and thus employed $12$ models with different configurations—two vanilla recurrent models, three MARNNs with three different softmax functions, and one StackRNN by Joulin and Mikolov (2015)—for the six tasks.
4.1 The Sequence Prediction Task
Following Gers and Schmidhuber (2001), we trained the networks as follows: At each time step, we presented one input symbol to the network and then asked the model to predict the set of next possible symbols in the language, based on the current symbol, the prior hidden states, and the stack. We used a onehot representation to encode the inputs and a $k$hot representation to encode the outputs. Table 1 provides example inputoutput pairs for two of the experiments.
In all the experiments, the objective was to minimize the meansquared error of the sequence predictions. We used an output threshold criterion of $0.5$ for the sigmoid layer (${y}_{t}=\sigma (\cdot )$) to indicate which symbols were predicted by the model. Finally, we turned this sequence prediction task into a sequence classification task by accepting a sequence if the model predicted all of its output values correctly and rejecting it otherwise.
4.2 Training Details
In contrast to the models in Joulin and Mikolov (2015); Grefenstette et al. (2015); Hao et al. (2018); Yu et al. (2019), our architectures are economical: Unless otherwise stated, the models are all singlelayer networks with $8$ hidden units. In all the experiments, the entries of the memory were set to be onedimensional, while the size of the memory in the BabyNTMs was fixed to $104$ (since the length of the longest sequence in all the tasks was $100$). We used the Adam optimizer (Kingma and Ba, 2014) and trained our models for three epochs.
5 Learning the Dyck Languages
As described in the introduction, the Dyck languages ${\mathcal{D}}_{>1}$ provide an ideal testbed for exploring the ability of recurrent neural networks to capture the core properties of the contextfree languages, their hierarchical modeling ability. None of the previous studies were able to learn the Dyck languages, with the exception of ${\mathcal{D}}_{1}$ (which can be captured using a simple onecounter machine). The main motivation of this paper was to introduce new neural architectures that could recognize ${\mathcal{D}}_{2}$ and other difficult contextfree languages.
Training Set  Test Set  
Models  Min  Max  Med  Mean  Min  Max  Med  Mean 
Vanilla RNN  3.32  12.78  6.41  7.11  0  0  0  0 
Vanilla LSTM  36.16  62.80  53.24  52.38  0.28  4.10  1.02  1.39 
StackRNN by J&M (2015)  0  100  100  70.50  0  100  100  70.00 
StackRNN+Softmax  100  100  100  100  99.96  100  100  99.99 
StackRNN+SoftmaxTemp  100  100  100  100  99.92  100  100  99.99 
StackRNN+GumbelSoftmax  3.44  100  99.98  90.32  0  100  99.96  89.96 
StackLSTM+Softmax  62.52  100  100  95.69  2.78  100  98.25  87.51 
StackLSTM+SoftmaxTemp  46.70  100  100  94.67  0.80  100  99.73  89.84 
StackLSTM+GumbelSoftmax  50.26  100  99.94  94.97  0.70  99.94  99.33  88.68 
BabyNTM+Softmax  2.56  100  100  75.80  0  100  99.91  68.73 
BabyNTM+SoftmaxTemp  1.16  100  99.88  72.43  0  100  96.97  68.23 
BabyNTM+GumbelSoftmax  5.66  100  99.88  89.39  0  99.90  99.54  86.85 
5.1 The ${\mathcal{D}}_{2}$ Language
We trained the StackRNN, StackLSTM, and BabyNTM architectures with slightly different memorycontroller configurations, in addition to standard RNNs, to learn ${\mathcal{D}}_{2}$.
A probabilistic contextfree grammar for ${\mathcal{D}}_{2}$ can be written as follows:
$S\to \{\begin{array}{cc}(S)\hfill & \text{with probability}\frac{p}{2}\hfill \\ \hfill & \text{with probability}\frac{p}{2}\hfill \\ SS\hfill & \text{with probability}q\hfill \\ \epsilon \hfill & \text{with probability}1(p+q)\hfill \end{array}$ 
where $$ and $$.
Setting $p=\frac{1}{2}$ and $q=\frac{1}{4}$, we generated $5000$ distinct Dyck words, whose lengths were bounded to $[2,50]$, for the training sets. Similarly, we generated $5000$ distinct words whose lengths were bounded to $[52,100]$ for the test sets. Hence, there was no overlap between the training and test sets. Test set performance requires generalization well past the training set lengths. As it can be seen in the length and maximum depth distributions of the training and test sets for one of the ${\mathcal{D}}_{2}$ experiments in Figure 2, the test samples contained longer dependencies than the training sample.
Table 2 lists the performances of the vanilla and memoryaugmented recurrent models on the training and test sets for the Dyck language. Our empirical results highlight the dramatic performance difference between the memoryaugmented recurrent networks and vanilla recurrent networks: We note that almost all our stack/memoryaugmented architectures achieved full accuracy on the test set, which contained longer and deeper sequences than the training set, while the vanilla RNNs and LSTMs failed to generalize with below $5\%$ accuracy. We further observe that the StackRNN proposed by Joulin and Mikolov (2015) performed nearly as well as our models, though ours performed better than theirs on average.
When evaluated based on their empirical median and mean percentwise performances, the StackRNNs appear to be slightly more successful than the StackLSTMs and the BabyNTMs. Both the StackRNN+Softmax and StackRNN+SoftmaxTemp obtained perfect accuracy on the test sets $8$ out of $10$ times, whereas the best StackLSTM variant, StackLSTM+SoftmaxTemp, achieved perfect accuracy only $3$ out of $10$ times. Nevertheless, we acknowledge that most of our stackaugmented models were able to successfully generalize well beyond the training data. ^{7}^{7} 7 We additionally note that, contrary to our initial expectation, using a softmax activation function with varying temperature did not improve the performance of our memoryaugmented neural models in general. However, the networks might actually benefit from temperaturebased softmax functions in the presence of more categorical choices, because currently the models have only a very limited number of memory operations.
Training Set  Test Set  
Models  Min  Max  Med  Mean  Min  Max  Med  Mean 
Vanilla RNN  0.82  14.88  11.19  9.52  0  0  0  0 
Vanilla LSTM  24.16  39.76  31.55  32.58  0  0.16  0.02  0.04 
StackRNN by J&M (2015)  9.02  100  98.17  79.32  0  100  91.29  66.72 
StackRNN+Softmax  7.80  100  100  81.75  0  100  100  80.00 
StackRNN+SoftmaxTemp  37.64  99.98  95.74  81.95  0.06  98.18  67.32  52.49 
StackRNN+GumbelSoftmax  1.78  100  44.55  50.71  0  99.98  21.94  43.65 
StackLSTM+Softmax  33.98  100  92.25  77.97  0.04  99.94  61.54  55.49 
StackLSTM+SoftmaxTemp  37.64  99.98  95.74  81.95  0.06  98.18  67.32  52.49 
StackLSTM+GumbelSoftmax  25.74  99.98  78.21  72.01  0  99.2  27.08  42.17 
BabyNTM+Softmax  4.60  100  84.29  60.63  0  100  23.44  44.51 
BabyNTM+SoftmaxTemp  6.40  100  16.44  39.97  0  100  0.51  27.46 
BabyNTM+GumbelSoftmax  0.76  100  11.70  43.42  0  99.9  0  38.76 
Figure 3 provides a visualization of the strengths of the memory operations and the change in the values of the entries of the memory component of one of our memoryaugmented models (a BabyNTM+Softmax with $8$ hidden units) at each time step when the model was presented a sample in ${\mathcal{D}}_{2}$. The BabyNTM appears to be using its ROTATERIGHT and POPRIGHT operations for the open parentheses ‘$($’ and ‘$[$’, respectively, and POPLEFT operation for both of the closing parentheses ‘$)$’ and ‘$]$’, thereby emulating a simple PDAlike mechanism. A careful inspection of the memory entries of the BabyNTM indicates that the model utilizes a special marker with a distinct value ($\sim 0.45$ in our example) to distinguish an empty stack configuration from a processed stack configuration. On the other hand, the memory alone does not dictate the output values: The hidden states of the model govern the overall behavior and embody a finitestate control, as shown in the formulation of the BabyNTM.^{8}^{8} 8 The visualizations for the other memoryaugmented models were qualitatively similar, though some networks learned more complex representations. We further emphasize that the dimensions of the external stack and memory entries in our MARNNs were set up to be onedimensional for visualization purposes, but we additionally experimented with higher dimensional memory structures and observed that such additions often increased the overall performances of the models, especially the performance of the BabyNTMs.
Training Set  Test Set  
Models  Min  Max  Med  Mean  Min  Max  Med  Mean 
Vanilla RNN  21.19  25.71  23.53  23.39  0  0.02  0  0 
Vanilla LSTM  32.47  41.62  37.35  37.05  0  0.06  0  0.01 
StackRNN by J&M (2015)  99.47  100  100  99.94  97.60  100  99.99  99.70 
StackRNN+Softmax  99.92  100  100  99.99  99.32  100  99.99  99.85 
StackRNN+SoftmaxTemp  36.83  100  98.54  80.09  0  100  78.44  60.88 
StackRNN+GumbelSoftmax  20.90  99.98  99.93  91.62  0  99.92  99.50  87.69 
StackLSTM+Softmax  98.48  100  99.99  99.79  91.12  100  99.18  98.23 
StackLSTM+SoftmaxTemp  36.83  100  98.54  80.09  0  100  78.44  60.88 
StackLSTM+GumbelSoftmax  36.20  99.94  67.50  68.62  0  99.90  24.61  44.47 
BabyNTM+Softmax  99.94  100  100  99.99  99.00  100  99.97  99.87 
BabyNTM+SoftmaxTemp  86.12  100  100  98.15  8.56  100  99.91  88.40 
BabyNTM+GumbelSoftmax  22.56  99.98  99.86  75.46  0  99.86  99.23  63.49 
5.2 The ${\mathcal{D}}_{3}$ and ${\mathcal{D}}_{6}$ Languages
We further conducted experiments on the ${\mathcal{D}}_{3}$ and ${\mathcal{D}}_{6}$ languages to evaluate the ability of our memoryaugmented architectures to encode more complex hierarchical representations. The training and test corpora were generated in the same style as the previous task; however, we included $15,000$ samples in the training set for ${\mathcal{D}}_{6}$, due to its complexity.
As shown in Table 3, the StackRNN+Softmax model had the best performance among all the neural networks on the ${\mathcal{D}}_{3}$ learning task, obtaining perfect accuracy in eight out of ten trials. Following the StackRNNs, the StackLSTMs and BabyNTMs, on average, achieved around $50\%$ and $37\%$ accuracy on the test set, respectively. On the other hand, the StackRNN by Joulin and Mikolov (2015) could generalize better than most of our models in terms of its median and mean scores, albeit still not better than our StackRNN.
Figure 4 illustrates the behavior of one of our BabyNTMs as the model is presented a long sequence in ${\mathcal{D}}_{3}$. It is remarkable to witness how the memoryaugmented model makes use of its external memory to learn a sequence of actions to recognize a sample in ${\mathcal{D}}_{3}$. Similar to the behavior of the previous model in Figure 3, the RNN controller of the BabyNTM model in this instance appears to be using the differentiable memory as a stacklike structure and inserting distinct values to the memory at different time steps. Furthermore, we note the presence of special markers ($0.22$ in the first half and $0.21$ in the second half – both colored blue) in the memory: These idiosyncratic memory elements marking the bottom of the used portion of the stack enable the model to know when to predict only the set of open parentheses.
In contrast, the overall performance of the MARNNs for ${\mathcal{D}}_{6}$ was much lower than for ${\mathcal{D}}_{2}$ and ${\mathcal{D}}_{3}$. For instance, none of our models could obtain full accuracy on the training or test sets; the maximum score our models could achieve was $60.38\%$. We wondered whether increasing the dimension of the memory would remedy the problem. Table 4 summarizes our new results with the same architectures containing $12$ hidden units and $5$dimensional augmented stack/memory. We saw a significant increase in the performance of our models: In $60$ out of $90$ trials, our enhanced MARNNs achieved almost perfect ($\ge 99\%$) accuracy on the test set.
6 Learning Palindrome Languages
Our previous results established that the MARNNs can learn Dyck languages, which represent the core of the CFL class. We note that the Dyck languages incorporate a notion of palindrome: The intersection of ${\mathcal{D}}_{n}$ with ${p}^{*}{\overline{p}}^{*}$ leads to a definition of a specific type of a homomorphic palindrome language $w\phi ({w}^{R})$, where ${}^{*}$ is the Kleene star, $\phi $ a homomorphism given by ${p}_{i}\mapsto {\overline{p}}_{i}$, and ${w}^{R}$ the reversal of $w$. Therefore, we would expect our models to be able to learn various deterministic versions of palindrome languages.
Training Set  Test Set  

Models  Min  Max  Med  Mean  Min  Max  Med  Mean 
Vanilla RNN  0  0  0  0  0  0  0  0 
Vanilla LSTM  0  5.22  2.23  2.39  0  0  0  0 
StackRNN by J&M (2015)  0  100  46.04  49.13  0  100  50.42  50.00 
StackRNN+Softmax  0  100  99.99  60.00  0  100  100  60.00 
StackRNN+SoftmaxTemp  0  100  100  70.00  0  100  100  70.00 
StackRNN+GumbelSoftmax  0  100  17.10  43.42  0  100  16.98  43.39 
StackLSTM+Softmax  0  100  100  61.36  0  100  100  60.00 
StackLSTM+SoftmaxTemp  0  100  100  70.07  0  100  100  70.00 
StackLSTM+GumbelSoftmax  0  100  100  80.20  0  100  99.98  79.99 
BabyNTM+Softmax  0  100  67.16  53.43  0  100  66.55  53.31 
BabyNTM+SoftmaxTemp  0  100  99.99  60.00  0  100  100  60.00 
BabyNTM+GumbelSoftmax  0  100  60.43  52.09  0  100  61.30  52.26 
6.1 Homomorphic Palindrome Language
Our first target of exploration is the deterministic homomorphic palindrome language, the language of words $w\mathrm{\#}\phi ({w}^{R})$ where $w\in {\{a,b,c\}}^{*}$, $\mathrm{\#}$ is a symbol serving to mark the center of the palindrome, and $\phi $ maps $a$ to $x$, $b$ to $y$, and $c$ to $z$. We use the notion of recognition from the previous section, predicting at each symbol the set of all possible following symbols. Viewed as a transduction, this amounts to the following task:
$w\mathrm{\#}\phi ({w}^{R})$  $\Rightarrow {(a/b/c/\mathrm{\#})}^{w}\phi ({w}^{R})\u22a3$ 
The training set for this task contained $5000$ unique samples of length varying from $2$ to $50$, and the test set contained $5000$ unique samples of length varying from $52$ to $100$. We remark that there was no overlap between the training and test sets, just as in the case of the Dyck language tasks.
Table 5 lists the performances of the vanilla and memoryaugmented models on the deterministic homomorphic palindrome language. We highlight the success of our models once again: While our MARNNs often performed with perfect accuracy on the training and test sets, the standard recurrent models could not predict even one sample in the test set correctly. Overall, most of the variants of the StackRNN/LSTM and BabyNTM models seem to have learned how to emulate pushdown automata: They learned to push certain values into their memory or stack whenever they read a character from the $\{a,b,c\}$ alphabet and then started popping them one by one after they would see $\mathrm{\#}$, and at the last step, the model predicted the end of the sequence token $\u22a3$. The models did not perform equally well though: When evaluated on their mean and median percentwise performances, for instance, the StackLSTMs were found to generalize better than the StackRNNs and the BabyNTMs in this task. Further, it is hard to make a conclusive statement about whether employing a softmax function with varying temperature in our MARNNs had any benefit. Nevertheless, the StackLSTMs+GumbelSoftmax performed slightly better than the other models in terms of their mean percentages on the test sets.
Training Set  Test Set  
Models  Min  Max  Med  Mean  Min  Max  Med  Mean 
Vanilla RNN  0.06  0.90  0.46  0.46  0  0  0  0 
Vanilla LSTM  0.68  5.08  3.62  3.50  0  0  0  0 
StackRNN by J&M (2015)  0.16  100  50.29  50.19  0  100  50.00  50.00 
StackRNN+Softmax  0.38  100  100  77.39  0  100  100  76.65 
StackRNN+SoftmaxTemp  0.14  100  100  80.05  0  100  100  80.00 
StackRNN+GumbelSoftmax  0.18  100  99.98  77.71  0  100  99.98  77.83 
StackLSTM+Softmax  2.02  100  100  80.60  0  100  100  79.99 
StackLSTM+SoftmaxTemp  0.06  100  100  70.49  0  100  100  70.00 
StackLSTM+GumbelSoftmax  2.18  100  100  80.48  0  100  100  80.00 
BabyNTM+Softmax  0.08  100  100  86.65  0  100  100  86.67 
BabyNTM+SoftmaxTemp  0  100  100  70.07  0  100  100  70.00 
BabyNTM+GumbelSoftmax  0.20  100  100  90.01  0  100  99.97  89.96 
6.2 Simple Palindrome Language
Taking the homomorphism $\phi $ to be the identity map in the previous language, it is reasonable to expect the models to learn the $w\mathrm{\#}{w}^{R}$ palindrome language. We evaluated recognition of this language once again as a possiblenextsymbol prediction task, which can be viewed as the following sequence transduction task:
$w\mathrm{\#}{w}^{R}$  $\Rightarrow {(a/b/c/\mathrm{\#})}^{w}{w}^{R}\u22a3$ 
Surprisingly, all of our MARNN models had difficulty learning this language. Only in three of $90$ trials were our MARNNs able to learn the language; other times, the models typically obtained $0\%$ accuracy during testing. When we increased the dimensionality of the memory to five, however, our MARNNs immediately learned the task with almost full accuracy again. Given that the only difference between the previous task and this task is the second half of the strings, we conjectured that our models were getting confused in the second half: Because of vocabulary overlap in the two halves, the models might be using information from the second half when predicting the set of possible symbols in the second half, thereby getting distracted and finding themselves stuck at bad local minima. To verify our hypothesis, we thus performed one more task, string reversal, which we describe in the following section.
7 Learning the StringReversal Task
In the previous section, we witnessed a strange phenomenon: Our MARNN models with $8$ hidden units and onedimensional external memory could learn the deterministic homomorphic palindrome language, but not the simple palindrome language. Since the only difference between the two tasks is the existence of a nontrivial isomorphism $\phi $ (and the vocabulary overlap in the two halves), we wanted to perform the string reversal task under a sequence transduction setting in which the reversal appears only in the output:
$w{\mathrm{\#}}^{w}$  $\mapsto {\mathrm{\#}}^{w}{w}^{R}$ 
The training and test sets were similar to the previous cases: $5000$ samples each, with lengths bounded by $[2,50]$ and $[52,100]$, respectively.
Table 6 illustrates that most of our MARNNs achieved perfect accuracy on the test sets in this version of the string reversal task. The results corroborated our conjecture by showing that when the second half of the input samples contained symbols from an alphabet other than the one used in the first half (in this case, the $\mathrm{\#}$ symbol), the memoryaugmented models do not get confused and act in the desired way (pushing elements into the stack in the first half and popping them one by one in the second half after seeing the marker $\mathrm{\#}$). When we visualized the hidden states and the memory entries of the models for this task, we observed that our MARNNs learned to emulate simple pushdownautomata.
8 Conclusion
In this paper, we introduced three memoryaugmented neural architectures and provided the first demonstration of neural networks learning to recognize the generalized Dyck languages, which represent the “core” of the contextfree language class. We further evaluated the learning capabilities of our models on recognizing the deterministic homomorphic palindrome language and simple palindrome language under the sequence prediction framework and performing the stringreversal task under a sequence transduction setting. In all the experiments, our MARNNs outperformed the vanilla RNN and LSTM models and often attained perfect accuracy on both the training and test sets.
Since we limited the dimensionality of the external memory in our memoryaugmented architectures to one, we were also able to visualize the changes in the external memory of the BabyNTMs trained to learn the ${\mathcal{D}}_{2}$ and ${\mathcal{D}}_{3}$ languages. Our simple analysis revealed that our MARNNs learned to emulate pushdownautomata to recognize these Dyck languages. Hao et al. (2018) mention that their NeuralStack models could not perfectly employ stackbased strategies to learn an appropriate representation to recognize the ${\mathcal{D}}_{2}$ language, and further address the difficulty of training stackaugmented recurrent networks. Although we agree that it is challenging to train MARNNs due to various optimization issues, one can still train these models with as few as eight or twelve hidden units to learn the Dyck languages, and our empirical findings support this claim.
9 Acknowledgment
The authors appreciate the helpful comments of Michael Hahn, Yoav Goldberg, Drew Pendergrass, Dan Stefan Eniceicu, and Filippos Sytilidis. M.S. gratefully acknowledges the support of the Harvard College Research Program (HCRP) and the Harvard Center for Research on Computation and Society Research Fellowship for Undergraduate Students. S.G. was supported by a Siebel Fellowship. Y.B. was supported by the Harvard Mind, Brain, and Behavior Initiative. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
 Bernardy (2018) JeanPhilippe Bernardy. 2018. Can Recurrent Neural Networks Learn Nested Recursion? LiLT (Linguistic Issues in Language Technology), 16(1).
 Bodén and Wiles (2000) Mikael Bodén and Janet Wiles. 2000. ContextFree and ContextSensitive Dynamics in Recurrent Neural Networks. Connection Science, 12(34):197–210.
 Bodén et al. (1999) Mikael Bodén, Janet Wiles, Bradley Tonkes, and Alan Blair. 1999. Learning to Predict a ContextFree Language: Analysis of Dynamics in Recurrent Hidden Units.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 Chomsky (1957) Noam Chomsky. 1957. Syntactic Structures. Mouton, The Hague.
 Chomsky (1962) Noam Chomsky. 1962. ContextFree Grammars and Pushdown Storage. MIT Res. Lab. Electron. Quart. Prog. Report., 65:187–194.
 Chomsky and Schützenberger (1963) Noam Chomsky and Marcel P Schützenberger. 1963. The Algebraic Theory of ContextFree Languages. In Studies in Logic and the Foundations of Mathematics, volume 35, pages 118–161. Elsevier.
 Das et al. (1992) Sreerupa Das, C Lee Giles, and GuoZheng Sun. 1992. Learning Contextfree Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, page 14.
 Das et al. (1993) Sreerupa Das, C Lee Giles, and GuoZheng Sun. 1993. Using Prior Knowledge in a NNPDA to Learn ContextFree Languages. In Advances in neural information processing systems, pages 65–72.
 Deleu and Dureau (2016) Tristan Deleu and Joseph Dureau. 2016. Learning Operations on a Stack with Neural Turing Machines. arXiv preprint arXiv:1612.00827.
 Elman (1990) Jeffrey L Elman. 1990. Finding Structure in Time. Cognitive science, 14(2):179–211.
 Elman (1991) Jeffrey L Elman. 1991. Distributed Representations, Simple Recurrent Networks, and Grammatical Structure. Machine learning, 7(23):195–225.
 Gers et al. (2002) Felix A Gers, Juan Antonio PérezOrtiz, Douglas Eck, and Jürgen Schmidhuber. 2002. Learning Context Sensitive Languages with LSTM Trained with Kalman Filters. In International Conference on Artificial Neural Networks, pages 655–660. Springer.
 Gers and Schmidhuber (2001) Felix A Gers and E Schmidhuber. 2001. LSTM Recurrent Networks Learn Simple ContextFree and ContextSensitive Languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE.
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. arXiv preprint arXiv:1410.5401.
 Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid Computing Using a Neural Network with Dynamic External Memory. Nature, 538(7626):471.
 Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. In Advances in Neural Information Processing Systems, pages 1828–1836.
 Greibach (1973) Sheila A Greibach. 1973. The hardest contextfree language. SIAM Journal on Computing, 2(4):304–310.
 Gulcehre et al. (2018) Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. 2018. Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. Neural computation, 30(4):857–884.
 Hahn (2019) Michael Hahn. 2019. Theoretical limitations of selfattention in neural sequence models. arXiv preprint arXiv:1906.06755.
 Hao et al. (2018) Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon Mendelsohn. 2018. ContextFree Transductions with Neural Stacks. arXiv preprint arXiv:1809.02836.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ShortTerm Memory. Neural computation, 9(8):1735–1780.
 Hölldobler et al. (1997) Steffen Hölldobler, Yvonne Kalinke, and Helko Lehmann. 1997. Designing a Counter: Another Case Study of Dynamics and Activation Landscapes in Recurrent Networks. In Annual Conference on Artificial Intelligence, pages 313–324. Springer.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical Reparameterization with GumbelSoftmax. arXiv preprint arXiv:1611.01144.
 Joulin and Mikolov (2015) Armand Joulin and Tomas Mikolov. 2015. Inferring Algorithmic Patterns with Stackaugmented Recurrent Nets. In Advances in neural information processing systems, pages 190–198.
 Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kurach et al. (2015) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. 2015. Neural RandomAccess Machines. arXiv preprint arXiv:1511.06392.
 Magniez et al. (2014) Frédéric Magniez, Claire Mathieu, and Ashwin Nayak. 2014. Recognizing wellparenthesized expressions in the streaming model. SIAM Journal on Computing, 43(6):1880–1905.
 Merrill (2019) William Merrill. 2019. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615.
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent Neural Network Based Language Model. In Eleventh annual conference of the international speech communication association.
 Pollack (1991) Jordan B Pollack. 1991. The Induction of Dynamical Recognizers. In Connectionist Approaches to Language Learning, pages 123–148. Springer.
 Rodriguez (2001) Paul Rodriguez. 2001. Simple Recurrent Networks Learn ContextFree and ContextSensitive Languages by Counting. Neural computation, 13(9):2093–2118.
 Rodriguez and Wiles (1998) Paul Rodriguez and Janet Wiles. 1998. Recurrent Neural Networks can Learn to Implement SymbolSensitive Counting. In Advances in Neural Information Processing Systems, pages 87–93.
 Schmidhuber et al. (2002) Jürgen Schmidhuber, F Gers, and Douglas Eck. 2002. Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation, 14(9):2039–2041.
 Sennhauser and Berwick (2018) Luzi Sennhauser and Robert Berwick. 2018. Evaluating the Ability of LSTMs to Learn ContextFree Grammars. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 115–124.
 Siegelmann and Sontag (1994) Hava T Siegelmann and Eduardo D Sontag. 1994. Analog Computation via Neural Networks. Theoretical Computer Science, 131(2):331–360.
 Siegelmann and Sontag (1995) Hava T Siegelmann and Eduardo D Sontag. 1995. On the Computational Power of Neural Nets. Journal of computer and system sciences, 50(1):132–150.
 Skachkova et al. (2018) Natalia Skachkova, Thomas Trost, and Dietrich Klakow. 2018. Closing Brackets with Recurrent Neural Networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 232–239.
 Steijvers (1996) Mark Steijvers. 1996. A Recurrent Network that Performs a ContextSensitive Prediction Task.
 Sudborough (1976) Ivan Hal Sudborough. 1976. On deterministic contextfree languages, multihead automata, and the power of an auxiliary pushdown store. In Proceedings of the eighth annual ACM symposium on Theory of computing, pages 141–148. ACM.
 Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling. In INTERSPEECH.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in neural information processing systems, pages 3104–3112.
 Suzgun et al. (2019a) Mirac Suzgun, Yonatan Belinkov, and Stuart M Shieber. 2019a. On Evaluating the Generalization of LSTM Models in Formal Languages. Proceedings of the Society for Computation in Linguistics (SCiL), pages 277–286.
 Suzgun et al. (2019b) Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart Shieber. 2019b. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence. Association for Computational Linguistics.
 Tonkes and Wiles (1997) Bradley Tonkes and Janet Wiles. 1997. Learning a ContextFree Task with a Recurrent Neural Network: An Analysis of Stability. In In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive Science Society. Citeseer.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Weiss et al. (2018) Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745.
 Yang (2016) Greg Yang. 2016. Lie Access Neural Turing Machine. arXiv preprint arXiv:1602.08671.
 Yu et al. (2019) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2019. Learning the Dyck language with attentionbased Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 138–146, Florence, Italy. Association for Computational Linguistics.
 Zaremba et al. (2016) Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. 2016. Learning Simple Algorithms from Examples. In International Conference on Machine Learning, pages 421–429.
 Zaremba and Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. 2015. Reinforcement Learning Neural Turing Machines–Revised. arXiv preprint arXiv:1505.00521.
 Zeng et al. (1994) Zheng Zeng, Rodney M Goodman, and Padhraic Smyth. 1994. Discrete Recurrent Neural Networks for Grammatical Inference. IEEE Transactions on Neural Networks, 5(2):320–330.
Appendix A Comparison of StackRNN architectures
Recall the formulation of our StackRNN architecture in Section 3.1. We update ${\mathbf{h}}_{t}$, the hidden state at time $t$, as follows:
${\mathbf{h}}_{t}$  $=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\stackrel{~}{\mathbf{h}}}_{(t1)}+{\mathbf{b}}_{hh})$ 
where ${\stackrel{~}{\mathbf{h}}}_{(t1)}$ is defined to be:
${\stackrel{~}{\mathbf{h}}}_{(t1)}$  $={\mathbf{h}}_{(t1)}+{\mathbf{W}}_{sh}{s}_{(t1)}^{(0)}$ 
Rewriting the equation for ${\mathbf{h}}_{t}$, we realize that our formulation of StackRNN is almost equivalent to the StackRNN model by Joulin and Mikolov (2015):
${\mathbf{h}}_{t}$  $=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\stackrel{~}{\mathbf{h}}}_{(t1)}+{\mathbf{b}}_{hh})$  
$=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}({\mathbf{h}}_{(t1)}+{\mathbf{W}}_{sh}{s}_{(t1)}^{(0)})+{\mathbf{b}}_{hh})$  
$=\text{tanh}({\mathbf{W}}_{ih}{x}_{t}+{\mathbf{b}}_{ih}+{\mathbf{W}}_{hh}{\mathbf{h}}_{(t1)}+\underset{(*)}{\underset{\u23df}{{\mathbf{W}}_{hh}{\mathbf{W}}_{sh}}}{s}_{(t1)}^{(0)}+{\mathbf{b}}_{hh})$ 
In our StackRNN architecture, ${s}_{(t1)}^{(0)}$ depends on $(*)$, namely ${\mathbf{W}}_{hh}{\mathbf{W}}_{sh}$, whereas in Joulin and Mikolov’s StackRNN model, it only depends on ${\mathbf{W}}_{sh}$. Furthermore, we make use of $\mathrm{tanh}(\cdot )$, instead of $\sigma (\cdot )$, to achieve nonlinearity and include bias terms, namely ${\mathbf{b}}_{ih}$ and ${\mathbf{b}}_{hh}$, in our definition of ${\mathbf{h}}_{t}$.