Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages

  • 2019-11-08 15:33:51
  • Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, Stuart M. Shieber
  • 2

Abstract

We introduce three memory-augmented Recurrent Neural Networks (MARNNs) andexplore their capabilities on a series of simple language modeling tasks whosesolutions require stack-based mechanisms. We provide the first demonstration ofneural networks recognizing the generalized Dyck languages, which express thecore of what it means to be a language with hierarchical structure. Ourmemory-augmented architectures are easy to train in an end-to-end fashion andcan learn the Dyck languages over as many as six parenthesis-pairs, in additionto two deterministic palindrome languages and the string-reversal transductiontask, by emulating pushdown automata. Our experiments highlight the increasedmodeling capacity of memory-augmented models over simple RNNs, while inflectingour understanding of the limitations of these models.

 

Quick Read (beta)

Memory-Augmented Recurrent Neural Networks
Can Learn Generalized Dyck Languages

Mirac Suzgun1 Sebastian Gehrmann1 Yonatan Belinkov12 Stuart M. Shieber1

1 Harvard John A. Paulson School of Engineering and Applied Sciences
2 MIT Computer Science and Artificial Intelligence Laboratory
Cambridge, MA, USA
{[email protected],{gehrmann,belinkov,shieber}@seas}.harvard.edu
Abstract

We introduce three memory-augmented Recurrent Neural Networks (MARNNs) and explore their capabilities on a series of simple language modeling tasks whose solutions require stack-based mechanisms. We provide the first demonstration of neural networks recognizing the generalized Dyck languages, which express the core of what it means to be a language with hierarchical structure. Our memory-augmented architectures are easy to train in an end-to-end fashion and can learn the Dyck languages over as many as six parenthesis-pairs, in addition to two deterministic palindrome languages and the string-reversal transduction task, by emulating pushdown automata. Our experiments highlight the increased modeling capacity of memory-augmented models over simple RNNs, while inflecting our understanding of the limitations of these models.

Memory-Augmented Recurrent Neural Networks
Can Learn Generalized Dyck Languages


Mirac Suzgun1 Sebastian Gehrmann1 Yonatan Belinkov12 Stuart M. Shieber1 1 Harvard John A. Paulson School of Engineering and Applied Sciences 2 MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA, USA {[email protected],{gehrmann,belinkov,shieber}@seas}.harvard.edu

1 Introduction

Recurrent Neural Networks (RNNs) have proven to be an effective and powerful model choice for capturing long-distance dependencies and complex representations in sequential tasks, such as language modeling (Mikolov et al., 2010; Sundermeyer et al., 2012), machine translation (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2014), and speech recognition (Graves et al., 2013). In theory, RNNs with rational state weights and infinite numeric precision are known to be computationally universal models (Siegelmann and Sontag, 1994, 1995). Yet, in practice, the computational power of RNNs with finite numeric precision is still unknown. Hence, the classes of languages that can be learned, empirically or theoretically, by RNNs with finite numeric precision are still to be discovered.

A natural question arises, then, as to what extent RNN models can learn languages that exemplify important formal properties found in natural languages, such properties as long-distance dependencies, counting, hierarchy, and repetition.

Along these lines, Gers and Schmidhuber (2001); Weiss et al. (2018); Suzgun et al. (2019a) have demonstrated that Long Short-Term Memory (LSTM; Hochreiter and Schmidhuber (1997)), a popular variant of RNNs, can develop counting mechanisms to recognize simple strictly context-free and context-sensitive languages, such as anbn and anbncn, as evidenced by analysis of the hidden state values.11 1 From an automata-theoretic perspective, such languages are expressible with simple one-turn counter machines. By contrast, Weiss et al. have shown that Gated Recurrent Units (GRUs; Cho et al. (2014)), another popular variant of RNNs, cannot perform this type of counting and provided an explanation for some of the difference in performance between LSTMs and GRUs.

Merrill (2019) studied the theoretical expressiveness of various real-time neural networks with finite precision under asymptotic conditions, showing that RNNs and GRUs can capture regular languages whereas LSTMs can further recognize a subset of real-time counter languages. And empirically, Suzgun et al. (2019b) demonstrated that LSTM networks can learn to perform dynamic counting, as exemplified by the well-balanced parenthesis (Dyck) language 𝒟1 as well as the shuffles of multiple 𝒟1 languages.

Counting in real-time is an important property, differentiating some of the language classes in the Chomsky hierarchy, and echoes of it appear in natural languages as well, for instance, in the requirement that the number of arguments of a set of verbs match their subcategorization requirements. But counting does not exhaust the kinds of structural properties that may be apposite for natural language. Chomsky (1957) emphasizes the hierarchical structure found in natural languages, for instance, in the nested matching of “both …and” and “either …or”. Indeed, this kind of nested-matching phenomenon forms the essence of the strictly context-free languages (CFLs). Here simple (real-time) counters are not sufficient; a stack is required. The formal-language-theory reflex of this phenomenon is found most sparely in the Dyck languages 𝒟n of well-nested strings over n pairs of brackets, where n>1. (We refer to these as the 𝒟>1 languages.)

The centrality of this nested stack structure in characterizing the class of context-free languages can be seen in various ways. (i) The automata-theoretic analog of context-free grammars, the pushdown automaton, is defined by its use of a stack (Chomsky, 1962). (ii) Chomsky and Schützenberger (1963) famously showed that all context-free languages are homomorphic images of regular-intersected 𝒟n languages.22 2 In particular, 𝒟2 is sufficient for this purpose (Magniez et al., 2014; Suzgun et al., 2019b). (iii) The hardest CFL of Greibach (1973) and the hardest deterministic CFL of Sudborough (1976) are built using Dyck-language-style matching. For these reasons, we think of the 𝒟>1 languages as expressing the core of what it means to be a context-free language with hierarchical structure, even if it is not itself a universal CFL. This property of the Dyck languages accounts for the heavy focus on the them in prior work (Deleu and Dureau, 2016; Bernardy, 2018; Sennhauser and Berwick, 2018; Skachkova et al., 2018; Hao et al., 2018; Zaremba et al., 2016; Suzgun et al., 2019a; Yu et al., 2019; Hahn, 2019) as well as in this work. It would thus be notable for finite precision neural networks to learn languages, like the 𝒟>1 languages and other languages requiring a stack, if we want these neural architectures to be able to manifest hierarchical structures.

In this paper, we introduce three enhanced RNN models that consist of recurrent layers and external memory structures, namely stack-augmented RNNs (Stack-RNNs), stack-augmented LSTMs (Stack-LSTMs), and Baby Neural Turing Machines (Baby-NTMs), and show that they can effectively learn to recognize some 𝒟>1 languages from limited data by emulating deterministic pushdown automata. Previous studies used simple RNN models (Bernardy, 2018; Sennhauser and Berwick, 2018; Suzgun et al., 2019b; Yu et al., 2019) and memory-augmented architectures (Hao et al., 2018) to attempt to learn 𝒟2 under different training platforms; however, none of them were able to obtain good performance on this task. We thus present the first demonstration that a memory-augmented neural network (MARNN) can learn 𝒟>1 languages. Moreover, we evaluate the learning capabilities of our architectures on six tasks whose solutions require the employment of stack-based approaches, namely learning the 𝒟2, 𝒟3 and 𝒟6 languages, recognizing the deterministic palindrome language (w#wR) and the deterministic homomorphic palindrome language (w#φ(wR)), and performing the string-reversal transduction task (w#|w|#|w|wR). Our results reflect the better modeling capacity of our MARNNs over the standard RNN and LSTM models in capturing hierarchical representations, in addition to providing an insightful glimpse of the promise of these models for real-world natural-language processing tasks.33 3 Our code is available at https://github.com/suzgunmirac/marnns.

2 Related Work

2.1 Learning Formal Languages Using Neural Networks

Using neural network architectures to recognize formal languages has been a central computational task in gaining an understanding of their expressive ability for application to natural-language-processing tasks. Elman (1991) marked the beginning of such methodological investigations and devised an artificial language learning platform where Simple Recurrent Networks (SRNs) (Elman, 1990), were trained to learn the hierarchical and recursive relationships between clauses. An analysis of the hidden state dynamics revealed that the models learned internal representations that encoded information about the grammatical structure and dependencies of the synthetic language. Later, Das et al. (1992) introduced the first RNN model with an external stack, the Recurrent Neural Network Pushdown Automaton (NNPDA), to learn simple deterministic context-free grammars.

Following Elman’s work, many studies used SRNs (Steijvers, 1996; Tonkes and Wiles, 1997; Hölldobler et al., 1997; Rodriguez and Wiles, 1998; Bodén et al., 1999; Bodén and Wiles, 2000; Rodriguez, 2001) and stack-based RNNs (Das et al., 1993; Zeng et al., 1994) to recognize simple context-free and context-sensitive counter languages, including anbn, anbncbmam, anbmBmAn, an+mbncm, anbncn, (ban)m, and 𝒟1. Nonetheless, none of the SRN-based models were able to generalize far beyond their training set. Some of these studies also focused on understanding and visualizing the internal representations learned by the hidden units of the networks, as well as the computational capabilities and limitations of the models.

In contrast, Gers and Schmidhuber (2001), Schmidhuber et al. (2002), and Gers et al. (2002), showed that their (LSTM) networks could not only competently learn two strictly context-free languages, anbn and anbmBmAn, and one strictly context-sensitive language anbncn, but also generalize far beyond the training datasets.

2.2 Memory-Augmented Neural Networks

Recently, memory-augmented architectures have been considered for language modeling tasks: Joulin and Mikolov (2015) proposed a differentiable stack structure controlled by an RNN to infer algorithmic patterns that require some combination of counting and memorization. Though their model could learn anbn, anbncn, anbncndn, anb2n, anbmcn+m, it did not exceed the performance of a standard LSTM on a language modeling task. Inspired by the early architecture design of NNPDA, Grefenstette et al. (2015) introduced LSTM models equipped with unbounded differentiable memory structures, such as stacks, queues, and double-linked lists, and explored their computational power on synthetic transduction tasks. In their experiments, their Neural-Stack and Neural-Queue architectures outperformed the standard LSTM architectures. Neither of these studies on stack-augmented neural architectures inspected the internal representations learned by the recurrent hidden layers or investigated the performance of their models on the Dyck language.

Graves et al. (2014) introduced the Neural Turing Machine (NTM), which consists of a neural network (which can be either feed-forward or recurrent) together with a differentiable external memory, and demonstrated its successful performance on a series of simple algorithmic tasks, such as copying, repeated copying, and sorting. At each time step, an NTM can interact with the external memory via its differentiable attention mechanisms and determine its output using the information from the current hidden state together with the filtered context from the external memory. It is evident from the design differences that the degree of freedom of NTMs is much greater than that of stack-augmented recurrent networks. However, this freedom comes at a price: The different ways in which we can attend to the memory to read and write at each time step make the training of the neural models challenging. Since the publication of the original NTM paper, there have been a number of studies addressing instability issues of the NTM architecture, or more broadly memory-augmented recurrent network models, and proposing new architecture designs. We refer to Zaremba and Sutskever (2015); Kurach et al. (2015); Yang (2016); Graves et al. (2016); Gulcehre et al. (2018) for such proposals.

2.3 Investigations of the Dyck Languages

Deleu and Dureau (2016) used NTMs to capture long-distance dependencies in 𝒟1. Their examination showed that NTMs indeed learn to emulate stack representations and generalize to longer sequences. However, a model need not be equipped with a stack to recognize this simplest Dyck language in a standard learning environment; counting is sufficient for an automaton to capture 𝒟1 (Suzgun et al., 2019b; Yu et al., 2019).

In assessing the ability of recurrent neural networks to process deep and long-distance dependencies, Skachkova et al. (2018) and Sennhauser and Berwick (2018) conducted experiments on the Dyck languages to see whether LSTMs could learn nested structures. The former sought to predict the single correct closing parenthesis, given a Dyck word without its final closing symbol. Although LSTMs performed almost perfectly in this completion task, one cannot draw any definitive conclusion about whether these models really learn the Dyck languages, since even counter automata can achieve perfect accuracy on this task.44 4 For instance, a model can learn to separately count the number of left and right parentheses for each of the n distinct pairs and predict the closing parenthesis for the pair for which the counter is non-zero. Suzgun et al. (2019b) and Yu et al. (2019) also discuss the drawbacks of Skachkova et al.’s learning task and argue that the task is insufficient for illustrating that a network can learn 𝒟2.

Similarly, Bernardy (2018) used three different recurrent networks, namely LSTM, GRU, and RUSS, and combinations thereof, to predict the next possible parenthesis at each time step, assuming that it is a closing parenthesis. His RUSS model is a purpose-designed model containing recurrent units with stack-like states and appears to generalize well to deeper and longer sequences. However, as the author mentions, the specificity of the RUSS architecture disqualifies it as a practical model choice for real-world language modeling tasks.

Hao et al. (2018) studied the interpretability of Neural Stack models (Grefenstette et al., 2015) in a number of simple language modeling tasks, including parenthesis prediction, string reversal, and XOR evaluation. Though their Neural Stacks exhibited intuitive stack behaviors on their context-free transduction tasks and performed almost as well as the standard LSTM models, the authors noted that their stack-augmented models were more difficult to train than the traditional LSTMs.

More recently, Suzgun et al. (2019b) corroborated the theoretical findings of Weiss et al. (2018) by showing that RNN, GRU, and LSTM models could perform dynamic counting by recognizing 𝒟1 as well as shuffles of multiple 𝒟1 languages by emulating simple k-counter machines, while being incapable of recognizing 𝒟2.

Adopting the experimental framework of Sennhauser and Berwick (2018) and the data generation procedure of Skachkova et al. (2018), Yu et al. (2019) conducted experiments on 𝒟2 under different training schemes and objectives using relatively large bi-directional LSTM models. Their recurrent networks failed to generalize well beyond the scope of their training data to learn 𝒟2 under the closing-parenthesis completion and sequence-to-sequence settings.55 5 We note that the authors attempted to generate the shortest proper sequence of closing parentheses given a prefix of a Dyck word under the sequence-to-sequence framework. This task is different from the previous tasks and requires a generative model.

Finally, Hahn (2019) used 𝒟2 to explore the theoretical limitations of self-attention architectures (Vaswani et al., 2017). He demonstrated that self-attention models, even when equipped with infinite precision, cannot capture 𝒟2, unless the number of layers or attention heads increases with the length of the input sequence.

In summary, recognizing the Dyck languages has been an important probing task for understanding the ability of neural networks to capture hierarchical information. Thus far, none of the recurrent neural networks have been shown to capture 𝒟>1. This present work, therefore, provides the first demonstration of RNN-based models learning 𝒟>1, in particular, 𝒟2, 𝒟3, and 𝒟6, in addition to other difficult context-free languages.

3 Models

In this section, we describe the mathematical formulations of our memory-augmented RNNs.The inspiration for our stack-augmented neural architectures came from the pushdown automaton, an abstract machine capable of recognizing context-free languages. Similar stack-based neural networks, however, have also been proposed by others (Pollack, 1991; Das et al., 1992; Joulin and Mikolov, 2015; Grefenstette et al., 2015). Our models differ from them in their theoretical simplicity and empirical success. Our Baby-NTM, on the other hand, can be considered as a simplification of the original NTM architecture (Graves et al., 2014): As opposed to using soft-attention mechanisms to read and write to the external memory, we make deterministic decisions and always read content from and write to the first entry of the memory, thereby making the learning process easier while retaining universal expressivity.

Notation

We will assume the following notation:

  • x=x1,,xT: The input sequence of one-hot vectors, with the i-th token xi.

  • yi: The output associated with xi.

  • 𝐖: The learnable weights of the model.

  • 𝐛: The learnable bias terms of the model.

  • 𝐡i: The i-th hidden state representation.

  • D: The dim. of the input and output samples.

  • H: The dim. of the hidden state of the model.

  • M: The dim. of the external stack/memory.

3.1 Stack-RNN

Before we begin describing our Stack-RNN, recall the formulation of a standard RNN:

𝐡t =tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡(t-1)+𝐛hh)
yt =f(𝐖y𝐡t)

where xtD is the input, 𝐡tH the hidden state, ytD the output at time t, 𝐖yD×H the linear output layer, and f a transformation.

\includegraphics

[width=0.85]architecture.png

Figure 1: An abstract representation of our Stack-RNN architecture.

While designing our Stack-RNN, we come across two important questions: (i) Where and how should we place the stack in the neural network, and (ii) how should we design the stack so that we can backpropagate errors through the stack at the time of training? Regarding (i), we place the stack in such a way that it interacts with the hidden layers at each time step. The benefit of this approach is that errors made in future stages of the model affect and backpropagate through not only the hidden states but also the stack states. Regarding (ii), we construct a differentiable stack structure. Figure 1 provides a visualization of the Stack-RNN. Its formulation is:

𝐡~(t-1) =𝐡(t-1)+𝐖shs(t-1)(0)
𝐡t =tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡~(t-1)+𝐛hh)
yt =σ(𝐖y𝐡t)
at =softmax(𝐖a𝐡t)
nt =σ(𝐖n𝐡t)
st(0) =at(0)nt+at(1)s(t-1)(1)
st(i) =at(0)s(t-1)(i-1)+at(1)s(t-1)(i+1)

where st=st(0)st(1)st(k) is the stack configuration at time step t, with st(0) the topmost stack element; 𝐖shH×M, 𝐖yD×H, 𝐖a2×H, and 𝐖nM×H are all learnable linear weights of the model.

At each time step, we combine the topmost stack element s(t-1)(0) with the previous hidden state 𝐡(t-1) via a linear mapping to produce an intermediate hidden state 𝐡~(t-1). We then use 𝐡~(t-1), together with the input, to generate the current hidden state 𝐡t, from which both the output at that time step yt and the weights of the PUSH (at(0)) and POP (at(1)) operations by the stack controller are determined simultaneously. Here at2 is a probability distribution over the two operations. Finally, we update the stack elements in such a way that the elements become the weighted linear interpolation of both possible stack operations. We can, therefore, consider the elements in the stack as variables in superposition states.

We highlight the following differences between our Stack-RNN and the Stack-RNN by Joulin and Mikolov (2015), as further explicated in the appendix. First, their model does not contain the term 𝐡~(t-1) and it updates 𝐡t as follows:

𝐡t =σ(𝐖ihxt+𝐖hh𝐡(t-1)+𝐖shs(t-1)(0:k))

where 𝐖shH×k and s(t-1)(0:k) the k-topmost elements of the stack at time t-1. But a simple analysis of our Stack-RNN formulation divulges that s(t-1)(0) depends on both 𝐖hh and 𝐖sh in our formulation, whereas it only depends on 𝐖sh in Joulin and Mikolov’s formulation. Furthermore, their architecture takes the sigmoid of the linear combination of xt, 𝐡(t-1), and s(t-1)(0:k), in addition to excluding the bias terms, to update 𝐡t.

3.2 Stack-LSTM

The Stack-LSTM is similar to the Stack-RNN but contains additional components of the standard LSTM architecture by Hochreiter and Schmidhuber (1997). In this model, we update the hidden state of the model according to the standard LSTM equations, that is 𝐡t=LSTM(xt,𝐡~(t-1)).

3.3 Baby-NTM

The Baby-NTM is both an extension of the Stack-RNN and a simplification of the original NTM. While the Stack-RNN contains an unbounded stack mechanism, it can perform only two basic operations on the stack, namely the PUSH and POP actions. In the Baby-NTM architecture, we fix the size of the external memory but provide more freedom to the model: While the interaction between the controller and the memory in the design of the Baby-NTM is mostly similar to that of the Stack-RNN, we allow five operations on the memory at each time step to update its contents: ROTATE-RIGHT, ROTATE-LEFT, NO-OP, POP-LEFT, and POP-RIGHT. Suppose that the current memory configuration 𝐌 is [a,b,c,d,e], where 𝐌(i). Then the operations produce the following configurations at the next time step:

ROTATE-RIGHT :[e,a,b,c,d].
ROTATE-LEFT :[b,c,d,e,a].
NO-OP :[a,b,c,d,e].
POP-RIGHT :[0,a,b,c,d].
POP-LEFT :[b,c,d,e,0].

If we think of the memory as a set 𝐌 sitting on an n-dimensional Euclidean space n, we can then think of these operations as n×n matrices. From an algebraic point of view, we can realize these actions on the memory as left-monoid actions on a set, since matrix multiplication is associative and the matrix corresponding to the operation NO-OP serves the role of an identity element in our computations. Below is the formulation of the Baby-NTM architecture:

𝐡~(t-1) =𝐡(t-1)+𝐖m𝐌(t-1)(0)
𝐡t =tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡~(t-1)+𝐛hh)
yt =σ(𝐖y𝐡t)
at =softmax(𝐖a𝐡t)
nt =σ(𝐖n𝐡t)
𝐌t =i=1Nat(i)[𝐎𝐏(i)]𝐌t-1
𝐌t(0) =𝐌t(0)+nt

where 𝐌t denotes the memory configuration at time step t, nt the value of the element to be inserted to the first entry of the memory at time step t, 𝐎𝐏(i) the matrix corresponding to the i-th action on the memory and at(i) the weight of that action at time t, and all 𝐖’s learnable matrices of the model.66 6 As before, the memory here can be considered as the linear superposition of the results of all the memory operations.

3.4 Softmax Functions

The softmax function in the calculation of at in all these models enables us to map the values of the vector 𝐖a𝐡t to a categorical probability distribution. We investigate the effect of more deterministic decisions about the stack/memory operations on the robustness of our model. A natural approach is to employ a softmax function with varying temperature τ:

softmax-temp(xi,τ)=exp(xi/τ)j=1Nexp(xj/τ)

The softmax-temp function behaves exactly like the standard softmax function when the temperature value τ equals 1. As the temperature increases, softmax-temp produces more uniform categorical class probabilities, whereas as the temperature decreases, the function outputs more discrete probabilities, like a one-hot encoding.

Sample ([]) abc#zyx
Input ( [ ] ) a b c # z y x
Output (/[/) (/[/] (/[/) (/[ a/b/c/# a/b/c/# a/b/c/# z y x
Table 1: Example input-output pairs for 𝒟2 (left) and the deterministic homomorphic palindrome language (right) under the sequence prediction paradigm.

Furthermore, Jang et al. (2016) proposed an efficient and differentiable approximation to sampling from a discrete categorical distribution using a reparameterization trick:

Gumbel-softmax-temp(xi,{g1,,gN},τ)
=exp((logxi+gi)/τ)j=1Nexp((logxj+gj)/τ)

where gii.i.d.Gumbel(0,1). As an alternative to the softmax function with varying temperature, one might want to use the Gumbel-softmax sampling method. In cases where we have more than two operations on the stack/memory, it might be tempting to prefer the Gumbel-softmax sampling approach for the calculation of at values. We experiment with these alternatives below.

4 Experimental Setup

To evaluate the performance of the MARNNs, we conducted experiments on six computational tasks whose solutions require the formation of stack structures. In all the experiments, we used both standard and memory-augmented RNNs to explore the differences in their performances, in addition to the Stack-RNN model by Joulin and Mikolov (2015), and repeated each experiment 10 times. Furthermore, we aimed to investigate softmax functions with varying temperature in our MARNNs and thus employed 12 models with different configurations—two vanilla recurrent models, three MARNNs with three different softmax functions, and one Stack-RNN by Joulin and Mikolov (2015)—for the six tasks.

4.1 The Sequence Prediction Task

Following Gers and Schmidhuber (2001), we trained the networks as follows: At each time step, we presented one input symbol to the network and then asked the model to predict the set of next possible symbols in the language, based on the current symbol, the prior hidden states, and the stack. We used a one-hot representation to encode the inputs and a k-hot representation to encode the outputs. Table 1 provides example input-output pairs for two of the experiments.

In all the experiments, the objective was to minimize the mean-squared error of the sequence predictions. We used an output threshold criterion of 0.5 for the sigmoid layer (yt=σ()) to indicate which symbols were predicted by the model. Finally, we turned this sequence prediction task into a sequence classification task by accepting a sequence if the model predicted all of its output values correctly and rejecting it otherwise.

4.2 Training Details

In contrast to the models in Joulin and Mikolov (2015); Grefenstette et al. (2015); Hao et al. (2018); Yu et al. (2019), our architectures are economical: Unless otherwise stated, the models are all single-layer networks with 8 hidden units. In all the experiments, the entries of the memory were set to be one-dimensional, while the size of the memory in the Baby-NTMs was fixed to 104 (since the length of the longest sequence in all the tasks was 100). We used the Adam optimizer (Kingma and Ba, 2014) and trained our models for three epochs.

5 Learning the Dyck Languages

As described in the introduction, the Dyck languages 𝒟>1 provide an ideal test-bed for exploring the ability of recurrent neural networks to capture the core properties of the context-free languages, their hierarchical modeling ability. None of the previous studies were able to learn the Dyck languages, with the exception of 𝒟1 (which can be captured using a simple one-counter machine). The main motivation of this paper was to introduce new neural architectures that could recognize 𝒟2 and other difficult context-free languages.

Training Set Test Set
Models Min Max Med Mean Min Max Med Mean
Vanilla RNN 3.32 12.78 6.41 7.11 0 0 0 0
Vanilla LSTM 36.16 62.80 53.24 52.38 0.28 4.10 1.02 1.39
Stack-RNN by J&M (2015) 0 100 100 70.50 0 100 100 70.00
Stack-RNN+Softmax 100 100 100 100 99.96 100 100 99.99
Stack-RNN+Softmax-Temp 100 100 100 100 99.92 100 100 99.99
Stack-RNN+Gumbel-Softmax 3.44 100 99.98 90.32 0 100 99.96 89.96
Stack-LSTM+Softmax 62.52 100 100 95.69 2.78 100 98.25 87.51
Stack-LSTM+Softmax-Temp 46.70 100 100 94.67 0.80 100 99.73 89.84
Stack-LSTM+Gumbel-Softmax 50.26 100 99.94 94.97 0.70 99.94 99.33 88.68
Baby-NTM+Softmax 2.56 100 100 75.80 0 100 99.91 68.73
Baby-NTM+Softmax-Temp 1.16 100 99.88 72.43 0 100 96.97 68.23
Baby-NTM+Gumbel-Softmax 5.66 100 99.88 89.39 0 99.90 99.54 86.85
Table 2: The performances of the vanilla and memory-augmented recurrent models on 𝒟2. Min/Max/Median/Mean results were obtained from 10 different runs of each model with the same random seed across each run. We note that both Stack-RNN+Softmax and Stack-RNN+Softmax-Temp achieved full accuracy on the test sets in 8 out of 10 times.

5.1 The 𝒟2 Language

We trained the Stack-RNN, Stack-LSTM, and Baby-NTM architectures with slightly different memory-controller configurations, in addition to standard RNNs, to learn 𝒟2.

A probabilistic context-free grammar for 𝒟2 can be written as follows:

S{(S)with probability p2with probability p2SSwith probability qεwith probability 1-(p+q)

where 0<p,q<1 and p+q<1.

Setting p=12 and q=14, we generated 5000 distinct Dyck words, whose lengths were bounded to [2,50], for the training sets. Similarly, we generated 5000 distinct words whose lengths were bounded to [52,100] for the test sets. Hence, there was no overlap between the training and test sets. Test set performance requires generalization well past the training set lengths. As it can be seen in the length and maximum depth distributions of the training and test sets for one of the 𝒟2 experiments in Figure 2, the test samples contained longer dependencies than the training sample.

\includegraphics

[width=0.50]newdyck_horiz.png

Figure 2: Length and maximum depth distributions of training/test sets for an example 𝒟2 experiment.
\includegraphics

[width=.98]ntm_viz_dyck2.png

Figure 3: Visualizations of the strength of the memory operations (left) and the values of the memory entries (right) of a Baby-NTM+Softmax model trained to learn 𝒟2. We highlight that the Baby-NTM appears to have learned to emulate a simple but effective differentiable pushdown automaton to recognize 𝒟2.

Table 2 lists the performances of the vanilla and memory-augmented recurrent models on the training and test sets for the Dyck language. Our empirical results highlight the dramatic performance difference between the memory-augmented recurrent networks and vanilla recurrent networks: We note that almost all our stack/memory-augmented architectures achieved full accuracy on the test set, which contained longer and deeper sequences than the training set, while the vanilla RNNs and LSTMs failed to generalize with below 5% accuracy. We further observe that the Stack-RNN proposed by Joulin and Mikolov (2015) performed nearly as well as our models, though ours performed better than theirs on average.

When evaluated based on their empirical median and mean percent-wise performances, the Stack-RNNs appear to be slightly more successful than the Stack-LSTMs and the Baby-NTMs. Both the Stack-RNN+Softmax and Stack-RNN+Softmax-Temp obtained perfect accuracy on the test sets 8 out of 10 times, whereas the best Stack-LSTM variant, Stack-LSTM+Softmax-Temp, achieved perfect accuracy only 3 out of 10 times. Nevertheless, we acknowledge that most of our stack-augmented models were able to successfully generalize well beyond the training data. 77 7 We additionally note that, contrary to our initial expectation, using a softmax activation function with varying temperature did not improve the performance of our memory-augmented neural models in general. However, the networks might actually benefit from temperature-based softmax functions in the presence of more categorical choices, because currently the models have only a very limited number of memory operations.

Training Set Test Set
Models Min Max Med Mean Min Max Med Mean
Vanilla RNN 0.82 14.88 11.19 9.52 0 0 0 0
Vanilla LSTM 24.16 39.76 31.55 32.58 0 0.16 0.02 0.04
Stack-RNN by J&M (2015) 9.02 100 98.17 79.32 0 100 91.29 66.72
Stack-RNN+Softmax 7.80 100 100 81.75 0 100 100 80.00
Stack-RNN+Softmax-Temp 37.64 99.98 95.74 81.95 0.06 98.18 67.32 52.49
Stack-RNN+Gumbel-Softmax 1.78 100 44.55 50.71 0 99.98 21.94 43.65
Stack-LSTM+Softmax 33.98 100 92.25 77.97 0.04 99.94 61.54 55.49
Stack-LSTM+Softmax-Temp 37.64 99.98 95.74 81.95 0.06 98.18 67.32 52.49
Stack-LSTM+Gumbel-Softmax 25.74 99.98 78.21 72.01 0 99.2 27.08 42.17
Baby-NTM+Softmax 4.60 100 84.29 60.63 0 100 23.44 44.51
Baby-NTM+Softmax-Temp 6.40 100 16.44 39.97 0 100 0.51 27.46
Baby-NTM+Gumbel-Softmax 0.76 100 11.70 43.42 0 99.9 0 38.76
Table 3: The performances of the vanilla and memory-augmented recurrent models on 𝒟3. In 32 out of 100 trials, the MARNNs with 8 hidden units and one-dimensional memory achieved over 99% accuracy on the test sets. However, increasing the dimensional of the memory for our MARNNs further improved our results.
\includegraphics

[width=.98]ntm_viz_dyck3_updated.png

Figure 4: Visualization of the values of the memory entries of a Baby-NTM+Softmax model trained to learn 𝒟3.

Figure 3 provides a visualization of the strengths of the memory operations and the change in the values of the entries of the memory component of one of our memory-augmented models (a Baby-NTM+Softmax with 8 hidden units) at each time step when the model was presented a sample in 𝒟2. The Baby-NTM appears to be using its ROTATE-RIGHT and POP-RIGHT operations for the open parentheses ‘(’ and ‘[’, respectively, and POP-LEFT operation for both of the closing parentheses ‘)’ and ‘]’, thereby emulating a simple PDA-like mechanism. A careful inspection of the memory entries of the Baby-NTM indicates that the model utilizes a special marker with a distinct value (0.45 in our example) to distinguish an empty stack configuration from a processed stack configuration. On the other hand, the memory alone does not dictate the output values: The hidden states of the model govern the overall behavior and embody a finite-state control, as shown in the formulation of the Baby-NTM.88 8 The visualizations for the other memory-augmented models were qualitatively similar, though some networks learned more complex representations. We further emphasize that the dimensions of the external stack and memory entries in our MARNNs were set up to be one-dimensional for visualization purposes, but we additionally experimented with higher dimensional memory structures and observed that such additions often increased the overall performances of the models, especially the performance of the Baby-NTMs.

Training Set Test Set
Models Min Max Med Mean Min Max Med Mean
Vanilla RNN 21.19 25.71 23.53 23.39 0 0.02 0 0
Vanilla LSTM 32.47 41.62 37.35 37.05 0 0.06 0 0.01
Stack-RNN by J&M (2015) 99.47 100 100 99.94 97.60 100 99.99 99.70
Stack-RNN+Softmax 99.92 100 100 99.99 99.32 100 99.99 99.85
Stack-RNN+Softmax-Temp 36.83 100 98.54 80.09 0 100 78.44 60.88
Stack-RNN+Gumbel-Softmax 20.90 99.98 99.93 91.62 0 99.92 99.50 87.69
Stack-LSTM+Softmax 98.48 100 99.99 99.79 91.12 100 99.18 98.23
Stack-LSTM+Softmax-Temp 36.83 100 98.54 80.09 0 100 78.44 60.88
Stack-LSTM+Gumbel-Softmax 36.20 99.94 67.50 68.62 0 99.90 24.61 44.47
Baby-NTM+Softmax 99.94 100 100 99.99 99.00 100 99.97 99.87
Baby-NTM+Softmax-Temp 86.12 100 100 98.15 8.56 100 99.91 88.40
Baby-NTM+Gumbel-Softmax 22.56 99.98 99.86 75.46 0 99.86 99.23 63.49
Table 4: The performances of the vanilla and memory-augmented recurrent models on the 𝒟6. We note that the MARNNs in this example contain 12 hidden units and 5-dimensional external stack/memory. In 70 out of 100 trials, the MARNNs performed over 99% accuracy. Overall, the Baby-NTM+Softmax had the best performance.

5.2 The 𝒟3 and 𝒟6 Languages

We further conducted experiments on the 𝒟3 and 𝒟6 languages to evaluate the ability of our memory-augmented architectures to encode more complex hierarchical representations. The training and test corpora were generated in the same style as the previous task; however, we included 15,000 samples in the training set for 𝒟6, due to its complexity.

As shown in Table 3, the Stack-RNN+Softmax model had the best performance among all the neural networks on the 𝒟3 learning task, obtaining perfect accuracy in eight out of ten trials. Following the Stack-RNNs, the Stack-LSTMs and Baby-NTMs, on average, achieved around 50% and 37% accuracy on the test set, respectively. On the other hand, the Stack-RNN by Joulin and Mikolov (2015) could generalize better than most of our models in terms of its median and mean scores, albeit still not better than our Stack-RNN.

Figure 4 illustrates the behavior of one of our Baby-NTMs as the model is presented a long sequence in 𝒟3. It is remarkable to witness how the memory-augmented model makes use of its external memory to learn a sequence of actions to recognize a sample in 𝒟3. Similar to the behavior of the previous model in Figure 3, the RNN controller of the Baby-NTM model in this instance appears to be using the differentiable memory as a stack-like structure and inserting distinct values to the memory at different time steps. Furthermore, we note the presence of special markers (0.22 in the first half and 0.21 in the second half – both colored blue) in the memory: These idiosyncratic memory elements marking the bottom of the used portion of the stack enable the model to know when to predict only the set of open parentheses.

In contrast, the overall performance of the MARNNs for 𝒟6 was much lower than for 𝒟2 and 𝒟3. For instance, none of our models could obtain full accuracy on the training or test sets; the maximum score our models could achieve was 60.38%. We wondered whether increasing the dimension of the memory would remedy the problem. Table 4 summarizes our new results with the same architectures containing 12 hidden units and 5-dimensional augmented stack/memory. We saw a significant increase in the performance of our models: In 60 out of 90 trials, our enhanced MARNNs achieved almost perfect (99%) accuracy on the test set.

6 Learning Palindrome Languages

Our previous results established that the MARNNs can learn Dyck languages, which represent the core of the CFL class. We note that the Dyck languages incorporate a notion of palindrome: The intersection of 𝒟n with p*p¯* leads to a definition of a specific type of a homomorphic palindrome language wφ(wR), where * is the Kleene star, φ a homomorphism given by pip¯i, and wR the reversal of w. Therefore, we would expect our models to be able to learn various deterministic versions of palindrome languages.

Training Set Test Set
Models Min Max Med Mean Min Max Med Mean
Vanilla RNN 0 0 0 0 0 0 0 0
Vanilla LSTM 0 5.22 2.23 2.39 0 0 0 0
Stack-RNN by J&M (2015) 0 100 46.04 49.13 0 100 50.42 50.00
Stack-RNN+Softmax 0 100 99.99 60.00 0 100 100 60.00
Stack-RNN+Softmax-Temp 0 100 100 70.00 0 100 100 70.00
Stack-RNN+Gumbel-Softmax 0 100 17.10 43.42 0 100 16.98 43.39
Stack-LSTM+Softmax 0 100 100 61.36 0 100 100 60.00
Stack-LSTM+Softmax-Temp 0 100 100 70.07 0 100 100 70.00
Stack-LSTM+Gumbel-Softmax 0 100 100 80.20 0 100 99.98 79.99
Baby-NTM+Softmax 0 100 67.16 53.43 0 100 66.55 53.31
Baby-NTM+Softmax-Temp 0 100 99.99 60.00 0 100 100 60.00
Baby-NTM+Gumbel-Softmax 0 100 60.43 52.09 0 100 61.30 52.26
Table 5: The performances of the vanilla and memory-augmented recurrent models on the deterministic homomorphic palindrome language. Most of our MARNNs achieved almost full accuracy on the test sets.

6.1 Homomorphic Palindrome Language

Our first target of exploration is the deterministic homomorphic palindrome language, the language of words w#φ(wR) where w{a,b,c}*, # is a symbol serving to mark the center of the palindrome, and φ maps a to x, b to y, and c to z. We use the notion of recognition from the previous section, predicting at each symbol the set of all possible following symbols. Viewed as a transduction, this amounts to the following task:

w#φ(wR) (a/b/c/#)|w|φ(wR)

The training set for this task contained 5000 unique samples of length varying from 2 to 50, and the test set contained 5000 unique samples of length varying from 52 to 100. We remark that there was no overlap between the training and test sets, just as in the case of the Dyck language tasks.

Table 5 lists the performances of the vanilla and memory-augmented models on the deterministic homomorphic palindrome language. We highlight the success of our models once again: While our MARNNs often performed with perfect accuracy on the training and test sets, the standard recurrent models could not predict even one sample in the test set correctly. Overall, most of the variants of the Stack-RNN/LSTM and Baby-NTM models seem to have learned how to emulate pushdown automata: They learned to push certain values into their memory or stack whenever they read a character from the {a,b,c} alphabet and then started popping them one by one after they would see #, and at the last step, the model predicted the end of the sequence token . The models did not perform equally well though: When evaluated on their mean and median percent-wise performances, for instance, the Stack-LSTMs were found to generalize better than the Stack-RNNs and the Baby-NTMs in this task. Further, it is hard to make a conclusive statement about whether employing a softmax function with varying temperature in our MARNNs had any benefit. Nevertheless, the Stack-LSTMs+Gumbel-Softmax performed slightly better than the other models in terms of their mean percentages on the test sets.

Training Set Test Set
Models Min Max Med Mean Min Max Med Mean
Vanilla RNN 0.06 0.90 0.46 0.46 0 0 0 0
Vanilla LSTM 0.68 5.08 3.62 3.50 0 0 0 0
Stack-RNN by J&M (2015) 0.16 100 50.29 50.19 0 100 50.00 50.00
Stack-RNN+Softmax 0.38 100 100 77.39 0 100 100 76.65
Stack-RNN+Softmax-Temp 0.14 100 100 80.05 0 100 100 80.00
Stack-RNN+Gumbel-Softmax 0.18 100 99.98 77.71 0 100 99.98 77.83
Stack-LSTM+Softmax 2.02 100 100 80.60 0 100 100 79.99
Stack-LSTM+Softmax-Temp 0.06 100 100 70.49 0 100 100 70.00
Stack-LSTM+Gumbel-Softmax 2.18 100 100 80.48 0 100 100 80.00
Baby-NTM+Softmax 0.08 100 100 86.65 0 100 100 86.67
Baby-NTM+Softmax-Temp 0 100 100 70.07 0 100 100 70.00
Baby-NTM+Gumbel-Softmax 0.20 100 100 90.01 0 100 99.97 89.96
Table 6: The performances of the vanilla and memory-augmented recurrent models on the string reversal task under the transduction setting. In 32 out of 90 trials, our MARNNs obtained perfect accuracy.

6.2 Simple Palindrome Language

Taking the homomorphism φ to be the identity map in the previous language, it is reasonable to expect the models to learn the w#wR palindrome language. We evaluated recognition of this language once again as a possible-next-symbol prediction task, which can be viewed as the following sequence transduction task:

w#wR (a/b/c/#)|w|wR

Surprisingly, all of our MARNN models had difficulty learning this language. Only in three of 90 trials were our MARNNs able to learn the language; other times, the models typically obtained 0% accuracy during testing. When we increased the dimensionality of the memory to five, however, our MARNNs immediately learned the task with almost full accuracy again. Given that the only difference between the previous task and this task is the second half of the strings, we conjectured that our models were getting confused in the second half: Because of vocabulary overlap in the two halves, the models might be using information from the second half when predicting the set of possible symbols in the second half, thereby getting distracted and finding themselves stuck at bad local minima. To verify our hypothesis, we thus performed one more task, string reversal, which we describe in the following section.

7 Learning the String-Reversal Task

In the previous section, we witnessed a strange phenomenon: Our MARNN models with 8 hidden units and one-dimensional external memory could learn the deterministic homomorphic palindrome language, but not the simple palindrome language. Since the only difference between the two tasks is the existence of a non-trivial isomorphism φ (and the vocabulary overlap in the two halves), we wanted to perform the string reversal task under a sequence transduction setting in which the reversal appears only in the output:

w#|w| #|w|wR

The training and test sets were similar to the previous cases: 5000 samples each, with lengths bounded by [2,50] and [52,100], respectively.

Table 6 illustrates that most of our MARNNs achieved perfect accuracy on the test sets in this version of the string reversal task. The results corroborated our conjecture by showing that when the second half of the input samples contained symbols from an alphabet other than the one used in the first half (in this case, the # symbol), the memory-augmented models do not get confused and act in the desired way (pushing elements into the stack in the first half and popping them one by one in the second half after seeing the marker #). When we visualized the hidden states and the memory entries of the models for this task, we observed that our MARNNs learned to emulate simple pushdown-automata.

8 Conclusion

In this paper, we introduced three memory-augmented neural architectures and provided the first demonstration of neural networks learning to recognize the generalized Dyck languages, which represent the “core” of the context-free language class. We further evaluated the learning capabilities of our models on recognizing the deterministic homomorphic palindrome language and simple palindrome language under the sequence prediction framework and performing the string-reversal task under a sequence transduction setting. In all the experiments, our MARNNs outperformed the vanilla RNN and LSTM models and often attained perfect accuracy on both the training and test sets.

Since we limited the dimensionality of the external memory in our memory-augmented architectures to one, we were also able to visualize the changes in the external memory of the Baby-NTMs trained to learn the 𝒟2 and 𝒟3 languages. Our simple analysis revealed that our MARNNs learned to emulate pushdown-automata to recognize these Dyck languages. Hao et al. (2018) mention that their Neural-Stack models could not perfectly employ stack-based strategies to learn an appropriate representation to recognize the 𝒟2 language, and further address the difficulty of training stack-augmented recurrent networks. Although we agree that it is challenging to train MARNNs due to various optimization issues, one can still train these models with as few as eight or twelve hidden units to learn the Dyck languages, and our empirical findings support this claim.

9 Acknowledgment

The authors appreciate the helpful comments of Michael Hahn, Yoav Goldberg, Drew Pendergrass, Dan Stefan Eniceicu, and Filippos Sytilidis. M.S. gratefully acknowledges the support of the Harvard College Research Program (HCRP) and the Harvard Center for Research on Computation and Society Research Fellowship for Undergraduate Students. S.G. was supported by a Siebel Fellowship. Y.B. was supported by the Harvard Mind, Brain, and Behavior Initiative. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.

References

  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
  • Bernardy (2018) Jean-Philippe Bernardy. 2018. Can Recurrent Neural Networks Learn Nested Recursion? LiLT (Linguistic Issues in Language Technology), 16(1).
  • Bodén and Wiles (2000) Mikael Bodén and Janet Wiles. 2000. Context-Free and Context-Sensitive Dynamics in Recurrent Neural Networks. Connection Science, 12(3-4):197–210.
  • Bodén et al. (1999) Mikael Bodén, Janet Wiles, Bradley Tonkes, and Alan Blair. 1999. Learning to Predict a Context-Free Language: Analysis of Dynamics in Recurrent Hidden Units.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Chomsky (1957) Noam Chomsky. 1957. Syntactic Structures. Mouton, The Hague.
  • Chomsky (1962) Noam Chomsky. 1962. Context-Free Grammars and Pushdown Storage. MIT Res. Lab. Electron. Quart. Prog. Report., 65:187–194.
  • Chomsky and Schützenberger (1963) Noam Chomsky and Marcel P Schützenberger. 1963. The Algebraic Theory of Context-Free Languages. In Studies in Logic and the Foundations of Mathematics, volume 35, pages 118–161. Elsevier.
  • Das et al. (1992) Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. 1992. Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, page 14.
  • Das et al. (1993) Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. 1993. Using Prior Knowledge in a NNPDA to Learn Context-Free Languages. In Advances in neural information processing systems, pages 65–72.
  • Deleu and Dureau (2016) Tristan Deleu and Joseph Dureau. 2016. Learning Operations on a Stack with Neural Turing Machines. arXiv preprint arXiv:1612.00827.
  • Elman (1990) Jeffrey L Elman. 1990. Finding Structure in Time. Cognitive science, 14(2):179–211.
  • Elman (1991) Jeffrey L Elman. 1991. Distributed Representations, Simple Recurrent Networks, and Grammatical Structure. Machine learning, 7(2-3):195–225.
  • Gers et al. (2002) Felix A Gers, Juan Antonio Pérez-Ortiz, Douglas Eck, and Jürgen Schmidhuber. 2002. Learning Context Sensitive Languages with LSTM Trained with Kalman Filters. In International Conference on Artificial Neural Networks, pages 655–660. Springer.
  • Gers and Schmidhuber (2001) Felix A Gers and E Schmidhuber. 2001. LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
  • Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE.
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. arXiv preprint arXiv:1410.5401.
  • Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid Computing Using a Neural Network with Dynamic External Memory. Nature, 538(7626):471.
  • Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. In Advances in Neural Information Processing Systems, pages 1828–1836.
  • Greibach (1973) Sheila A Greibach. 1973. The hardest context-free language. SIAM Journal on Computing, 2(4):304–310.
  • Gulcehre et al. (2018) Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. 2018. Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes. Neural computation, 30(4):857–884.
  • Hahn (2019) Michael Hahn. 2019. Theoretical limitations of self-attention in neural sequence models. arXiv preprint arXiv:1906.06755.
  • Hao et al. (2018) Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon Mendelsohn. 2018. Context-Free Transductions with Neural Stacks. arXiv preprint arXiv:1809.02836.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, 9(8):1735–1780.
  • Hölldobler et al. (1997) Steffen Hölldobler, Yvonne Kalinke, and Helko Lehmann. 1997. Designing a Counter: Another Case Study of Dynamics and Activation Landscapes in Recurrent Networks. In Annual Conference on Artificial Intelligence, pages 313–324. Springer.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical Reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144.
  • Joulin and Mikolov (2015) Armand Joulin and Tomas Mikolov. 2015. Inferring Algorithmic Patterns with Stack-augmented Recurrent Nets. In Advances in neural information processing systems, pages 190–198.
  • Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kurach et al. (2015) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. 2015. Neural Random-Access Machines. arXiv preprint arXiv:1511.06392.
  • Magniez et al. (2014) Frédéric Magniez, Claire Mathieu, and Ashwin Nayak. 2014. Recognizing well-parenthesized expressions in the streaming model. SIAM Journal on Computing, 43(6):1880–1905.
  • Merrill (2019) William Merrill. 2019. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615.
  • Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent Neural Network Based Language Model. In Eleventh annual conference of the international speech communication association.
  • Pollack (1991) Jordan B Pollack. 1991. The Induction of Dynamical Recognizers. In Connectionist Approaches to Language Learning, pages 123–148. Springer.
  • Rodriguez (2001) Paul Rodriguez. 2001. Simple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by Counting. Neural computation, 13(9):2093–2118.
  • Rodriguez and Wiles (1998) Paul Rodriguez and Janet Wiles. 1998. Recurrent Neural Networks can Learn to Implement Symbol-Sensitive Counting. In Advances in Neural Information Processing Systems, pages 87–93.
  • Schmidhuber et al. (2002) Jürgen Schmidhuber, F Gers, and Douglas Eck. 2002. Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation, 14(9):2039–2041.
  • Sennhauser and Berwick (2018) Luzi Sennhauser and Robert Berwick. 2018. Evaluating the Ability of LSTMs to Learn Context-Free Grammars. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 115–124.
  • Siegelmann and Sontag (1994) Hava T Siegelmann and Eduardo D Sontag. 1994. Analog Computation via Neural Networks. Theoretical Computer Science, 131(2):331–360.
  • Siegelmann and Sontag (1995) Hava T Siegelmann and Eduardo D Sontag. 1995. On the Computational Power of Neural Nets. Journal of computer and system sciences, 50(1):132–150.
  • Skachkova et al. (2018) Natalia Skachkova, Thomas Trost, and Dietrich Klakow. 2018. Closing Brackets with Recurrent Neural Networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 232–239.
  • Steijvers (1996) Mark Steijvers. 1996. A Recurrent Network that Performs a Context-Sensitive Prediction Task.
  • Sudborough (1976) Ivan Hal Sudborough. 1976. On deterministic context-free languages, multihead automata, and the power of an auxiliary pushdown store. In Proceedings of the eighth annual ACM symposium on Theory of computing, pages 141–148. ACM.
  • Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling. In INTERSPEECH.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in neural information processing systems, pages 3104–3112.
  • Suzgun et al. (2019a) Mirac Suzgun, Yonatan Belinkov, and Stuart M Shieber. 2019a. On Evaluating the Generalization of LSTM Models in Formal Languages. Proceedings of the Society for Computation in Linguistics (SCiL), pages 277–286.
  • Suzgun et al. (2019b) Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart Shieber. 2019b. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence. Association for Computational Linguistics.
  • Tonkes and Wiles (1997) Bradley Tonkes and Janet Wiles. 1997. Learning a Context-Free Task with a Recurrent Neural Network: An Analysis of Stability. In In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive Science Society. Citeseer.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Weiss et al. (2018) Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745.
  • Yang (2016) Greg Yang. 2016. Lie Access Neural Turing Machine. arXiv preprint arXiv:1602.08671.
  • Yu et al. (2019) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2019. Learning the Dyck language with attention-based Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 138–146, Florence, Italy. Association for Computational Linguistics.
  • Zaremba et al. (2016) Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. 2016. Learning Simple Algorithms from Examples. In International Conference on Machine Learning, pages 421–429.
  • Zaremba and Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. 2015. Reinforcement Learning Neural Turing Machines–Revised. arXiv preprint arXiv:1505.00521.
  • Zeng et al. (1994) Zheng Zeng, Rodney M Goodman, and Padhraic Smyth. 1994. Discrete Recurrent Neural Networks for Grammatical Inference. IEEE Transactions on Neural Networks, 5(2):320–330.

Appendix A Comparison of Stack-RNN architectures

Recall the formulation of our Stack-RNN architecture in Section 3.1. We update 𝐡t, the hidden state at time t, as follows:

𝐡t =tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡~(t-1)+𝐛hh)

where 𝐡~(t-1) is defined to be:

𝐡~(t-1) =𝐡(t-1)+𝐖shs(t-1)(0)

Rewriting the equation for 𝐡t, we realize that our formulation of Stack-RNN is almost equivalent to the Stack-RNN model by Joulin and Mikolov (2015):

𝐡t =tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡~(t-1)+𝐛hh)
=tanh(𝐖ihxt+𝐛ih+𝐖hh(𝐡(t-1)+𝐖shs(t-1)(0))+𝐛hh)
=tanh(𝐖ihxt+𝐛ih+𝐖hh𝐡(t-1)+𝐖hh𝐖sh(*)s(t-1)(0)+𝐛hh)

In our Stack-RNN architecture, s(t-1)(0) depends on (*), namely 𝐖hh𝐖sh, whereas in Joulin and Mikolov’s Stack-RNN model, it only depends on 𝐖sh. Furthermore, we make use of tanh(), instead of σ(), to achieve non-linearity and include bias terms, namely 𝐛ih and 𝐛hh, in our definition of 𝐡t.