Multiplicative Models for Recurrent Language Modeling

  • 2019-06-30 20:51:43
  • Diego Maupomé, Marie-Jean Meurs
  • 0

Abstract

Recently, there has been interest in multiplicative recurrent neural networksfor language modeling. Indeed, simple Recurrent Neural Networks (RNNs)encounter difficulties recovering from past mistakes when generating sequencesdue to high correlation between hidden states. These challenges can bemitigated by integrating second-order terms in the hidden-state update. Onesuch model, multiplicative Long Short-Term Memory (mLSTM) is particularlyinteresting in its original formulation because of the sharing of itssecond-order term, referred to as the intermediate state. We explore thesearchitectural improvements by introducing new models and testing them oncharacter-level language modeling tasks. This allows us to establish therelevance of shared parametrization in recurrent language modeling.

 

Quick Read (beta)

Multiplicative Models for Recurrent
Language Modeling

Diego Maupomé Université du Québec à Montréal, Montréal, QC, Canada    Marie-Jean Meurs
[email protected]
[email protected]
Université du Québec à Montréal, Montréal, QC, Canada
Abstract

Recently, there has been interest in multiplicative recurrent neural networks for language modeling. Indeed, simple Recurrent Neural Networks (RNNs) encounter difficulties recovering from past mistakes when generating sequences due to high correlation between hidden states. These challenges can be mitigated by integrating second-order terms in the hidden-state update. One such model, multiplicative Long Short-Term Memory (mLSTM) is particularly interesting in its original formulation because of the sharing of its second-order term, referred to as the intermediate state. We explore these architectural improvements by introducing new models and testing them on character-level language modeling tasks. This allows us to establish the relevance of shared parametrization in recurrent language modeling.

\glsdisablehyper\newacronym

nlpNLPNatural Language Processing \newacronymrnnRNNRecurrent Neural Network \newacronymlstmLSTMLong Short-Term Memory network \newacronymgruGRUGated Recurrent Unit \newacronymmrnnmRNN multiplicative \glsrnn \newacronymmlstmmLSTM multiplicative \glslstm \newacronymmgrumGRUmultiplicative \glsgru \newacronymtrnntRNNtensor \glsrnn \newacronymtmlstmtmLSTMtrue \glsmlstm \newacronymtmgrutmGRUtrue multiplicative \glsgru \newacronymmirnnMI-RNNMultiplicative Integration \glsrnn \newacronymmilstmMI-LSTMMultiplicative Integration \glslstm \newacronymrlmRLMRecurrent Language Model \newacronymbpcBPCbits per character

1 Introduction

One of the principal challenges in computational linguistics is to account for the word order of the document or utterance being processed [6]. Of course, the numbers of possible phrases grows exponentially with respect to a given phrase length, requiring an approximate approach to summarizing its content. \glsplrnn are such an approach, and they are used in various tasks in \glsnlp, such as machine translation [16], abstractive summarization [20] and question answering [11]. However, \glsplrnn, as approximations, suffer from numerical troubles that have been identified, such as that of recovering from past errors when generating phrases. We take interest in a model that mitigates this problem, \glsplmrnn, and how it has been and can be combined for new models. To evaluate these models, we use the task of recurrent language modeling, which consists in predicting the next token (character or word) in a document. This paper is organized as follows: \glsplrnn and \glsplmrnn are introduced respectively in Sections 2 and 3. Section 4 presents new and existing multiplicative models. Section 5 describes the datasets and experiments performed, as well as results obtained. Sections  6 discusses and concludes our findings.

2 Recurrent neural networks

\glspl

rnn are powerful tools of sequence modeling that can preserve the order of words or characters in a document. A document is therefore a sequence of words, x1,,xT. Given the exponential growth of possible histories with respect to the sequence length, the probability of observing a given sequence needs to be approximated. \glsplrnn will make this approximation using the product rule,

P(x1,,xT)=P(x1)P(x2|x1)P(xT|x1,,xT-1),

and updating a hidden state at every time step. This state is first null,

h0=𝟎.

Thereafter, it is computed as a function of the past hidden state as well as the input at the current time step,

ht=f(ht-1,xt),

known as the transition function. f is a learned function, often taking the form

ht=tanh(Uxt+Wht-1).11 1 Additive biases are omitted throughout the paper for concision

This allows, in theory, for straightforward modeling of sequences of arbitrary length.

In practice, \glsplrnn encounter some difficulties that need some clever engineering to be mitigated. For example, learning long-term dependencies such as those found in language is not without its share of woes arising from numerical considerations, such as the well-known vanishing gradient problem [2]. This can be addressed with gating mechanisms, such as \glslstm [9] and \glsgru [3].

A problem that is more specific to generative \glsplrnn is their difficulty recovering from past errors [7], which [15] argue arises from having hidden-state transitions that are highly correlated across possible inputs. One approach to adapting \glsplrnn to have more input-dependent transition functions is to use the multiplicative ”trick” [23]. This approximates the idea of having the input at each time synthesize a dedicated kernel of parameters dictating the transition from the previous hidden state to the next. These two approaches can be combined, as in the \glsmlstm [15].

We begin by contending that, in making \glsplrnn multiplicative, sharing what is known as the intermediate state does not significantly hinder performance when parameter counts are equal. We verify this with existing as well as new gated models on several well-known language modeling tasks.

3 Multiplicative RNNs

Most recurrent neural network architectures, including \glslstm and \glsgru share the following building block:

h~t=Uxt+Wht-1. (1)

h~t is the candidate hidden state, computed from the previous hidden state, ht-1, and the current input, xt, weighted by the parameter matrices W and U, respectively. This candidate hidden state may then be passed through gating mechanisms and non-linearities depending on the specific recurrent model.

Let us assume for simplicity that the input is a one-hot vector (one component is 1, the rest are 0 [22] [see p.45]), as it is often the case in \glsnlp. Then, the term Uxt is reduced to a single column of U and can therefore be thought of as an input-dependent bias in the hidden state transition. As the dependencies we wish to establish between the elements of the sequences under consideration become more distant, the term Wht will have to be significantly larger than this input-dependent bias, Uxt, in order to remain unchanged across time-steps. This will mean that from one time-step to the next, the hidden-to-hidden transition will be highly correlated across possible inputs. This can be addressed by having more input-dependent hidden state transitions, making \glsplrnn more expressive.

In order to remedy the aforementioned problem, each possible input i can be given its own matrix W(i) parameterizing the contribution of ht to h~t.

h~t=Uxt+(iW(i)xt(i))𝐖(xt)ht-1. (2)

This is known as a \glstrnn [23], because all the matrices can be stacked to form a rank 3 tensor, 𝐖. The input xt selects the relevant slice of the tensor in the one-hot case and a weighted sum over all slices in the dense case. The resulting matrix then acts as the appropriate W.

However, such an approach is impractical because of the high parameter count such a tensor would entail. The tensor can nonetheless be approximated by factorizing it [24] as follows:

𝐖(xt)=Vdiag(Wxxt)Wh, (3)

where Wx and Wh are weight matrices, and diag is the operator turning a vector v into a diagonal matrix where the elements of v form the main diagonal of said matrix. Replacing 𝐖(xt) in Equation (2) by this tensor factorization, we obtain

h~t=Uxt+Vmt, (4)

where mt is known as the intermediate state, given by

mt=(Wxxt)*(Whht-1). (5)

Here, * refers to the Hadamard or element-wise product of vectors. The intermediate state is the result of having the input apply a learned filter via the new parameter kernel W to the factors of the hidden state. It should be noted that the dimensionality of mt is free and, should it become sufficiently large, the factorization becomes as expressive as the tensor. The ensuing model is known as a \glsmrnn [23].

4 Sharing intermediate states

While \glsplmrnn outperform simple \glsplrnn in character-level language modeling, they have been found wanting with respect to the popular \glslstm [9]. This prompted [15] to apply the multiplicative ”trick” to \glslstm resulting in the \glsmlstm, which achieved promising results in several language modeling tasks [15].

4.1 mLSTM

Gated \glsplrnn, such as \glslstm and \glsgru, use gates to help signals move through the network. The value of these gates is computed in much the same way as the candidate hidden state, albeit with different parameters. For example, \glslstm uses two different gates, i and f in updating its memory cell, ct,

ct=ft*ct-1+it*tanh(h~t). (6)

It uses another gate, o, in mapping ct to the new hidden state, ht,

ht=ot*σ(ct), (7)

where σ is the sigmoid function, squashing its input between 0 and 1. f and i are known as forget and input gates, respectively. The forget gates allows the network to ignore components of the value of the memory cell at the past state. The input gate filters out certain components of the new hidden state. Finally, the output gates separates the memory cell from the actual hidden state. The values of these gates are computed at each time step as follows:

it=σ(Uixt+Wiht-1) (8)
ft=σ(Ufxt+Wfht-1) (9)
ot=σ(Uoxt+Woht-1). (10)

Each gate has its own set of parameters to infer. If we were to replace each W by a tensor factorization as in \glsmrnn, we would obtain a \glsmlstm model. However, in the original formulation of \glsmlstm, there is no factorization of each would-be 𝐖 individually. There is no separate intermediate state for each gate, as one would expect. Instead, a single intermediate state, mt, is computed to replace ht-1 in all equations in the system, by Eq.5. Furthermore, each gate has its own V weighting mt. Their values are computed as follows:

it=σ(Wiht-1+Vimt) (11)
ft=σ(Wfht-1+Vfmt) (12)
ot=σ(Woht-1+Vomt). (13)

The model can therefore no longer be understood as as an approximation of the \glstrnn. Nonetheless, it has achieved empirical success in \glsnlp. We therefore try to explore the empirical merits of this shared parametrization and apply them to other \glsrnn architectures.

4.2 True mLSTM

We have presented the original \glsmlstm model with its shared intermediate state. If we wish to remain true to the original multiplicative model, however, we have to factorize every would-be W tensor separately. We have:

it=σ(Uixt+Vimi,t) (14)
ft=σ(Ufxt+Vfmf,t) (15)
ot=σ(Uoxt+Vomo,t), (16)

with each m,t being given by a separate set of parameters:

m,t=(W,xxt)*(W,hht-1). (17)

We henceforth refer to this model as \glstmlstm. We sought to apply the same modifications to the \glsgru model, as \glslstm and \glsgru are known to perform similarly [8, 4, 12]. That is, we build a \glstmgru model, as well as a \glsmgru with a shared intermediate state.

4.3 GRU

The \glsgru was first proposed by [3] as a lighter, simpler variant of \glslstm. \Glsgru relies on two gates, called, respectively, the update and reset gates, and no additional memory cell. These gates intervene in the computation of the hidden state as follows:

ht=(1-zt)ht-1+zttanh(h~t), (18)

where the candidate hidden state, h~t, is given by:

h~t=Uhxt+Wh(rt*ht-1). (19)

The update gate deletes specific components of the hidden state and replaces them with those of the candidate hidden state, thus updating its content. On the other hand, the reset gate allows the unit to start anew, as if it were reading the first symbol of the input sequence. They are computed much in the same way as the gates of \glslstm:

zt=σ(Uzxt+Wzht-1), (20)
rt=σ(Urxt+Wrht-1). (21)

4.4 True mGRU

We can now make \glsgru multiplicative by using the tensor factorization for z and r:

zt=σ(Uzxt+Vzmz,t), (22)
rt=σ(Urxt)+Vrmr,t, (23)

with each m,t given by Eq. 17. There is a subtlety to computing h~t, as we need to apply the reset gate to ht-1. While ht itself is given by Eq. 4, mh,t is not computed the same way as in \glsmlstm and \glsmrnn. Instead, it is given by:

mh,t=(Wxxt)*(Wh(rt*ht-1)). (24)

4.5 mGRU with shared intermediate state

Sharing an intermediate state is not as immediate for \glsgru. This is due to the application of rt, which we need in computing the intermediate state that we want to share. That is, rt and mt would both depend on each other. We modify the role of rt to act as a filter on mt, rather than a reset on individual components of ht-1. Note that, when all components of rt go to zero, it amounts to having all components of ht-1 at zero. We have

zt=σ(Uzxt+Vzmt) (25)

and

rt=σ(Urxt+Vrmt). (26)

h~t is given by

h~t=Uhxt+Vh(rt*mt), (27)

with mt the same as in \glsmrnn and \glsmlstm this time, i.e. Eq.5. The final hidden state is computed the same way as in the original \glsgru (Eq.18).

5 Experiments in character-level language modeling

Character-level language modeling (or character prediction) consists in predicting the next character while reading a document one character at a time. It is a common benchmark for \glsplrnn because of the heightened need for shared parametrization when compared to word-level models. We test \glsmgru on two well-known datasets, the Penn Treebank and Text8.

5.1 Penn Treebank

The Penn Treebank dataset [17] comes from a series of Wall Street Journal articles written in English. Following [18], sections 0-20 were used for training, 21-22 for validation and 23-24 for testing, respectively, which amounts to 5.1M, 400K and 450K characters, respectively.

The vocabulary consists of 10K lowercase words. All punctuation is removed and numbers were substituted for a single capital N. All words out of vocabulary are replaced by the token <unk>.

The training sequences were passed to the model in batches of 32 sequences. Following [15], we built an initial \glsmlstm model of 700 units. However, we set the dimensionality of the intermediate state to that of the input in order to keep the model small. We do the same for our \glsmgru, \glstmlstm and \glstmgru, changing only the size of the hidden state so that all four models have roughly the same parameter count. We trained it using the Adam optimizer [13], selecting the best model on validation over 10 epochs. We apply no regularization other than a checkpoint which keeps the best model over all epochs. The performance of the model is evaluated using cross entropy in \glsbpc, which is log2 of perplexity.

All models outperform previously reported results for \glsmlstm [15] despite lower parameter counts. This is likely due to our relatively small batch size. However, they perform fairly similarly. Encouraged by these results, we built an \glsmgru with both hidden and intermediate state sizes set to that of the original \glsmlstm (700). This version highly surpasses the previous state of the art while still having fewer parameters than previous work.

For the sake of comparison, results as well as parameter counts (where available) of our models (bold) and related approaches are presented in Table 1. \glsmgru and larger \glsmgru, our best models, achieved respectively an error of 1.07 and 0.98 \glsbpc on the test data, setting a new state of the art for this task.

Model Parameter count Error(BPC)
\glsgru [1] 3M 1.53
\glsmrnn [18] - 1.41
\glslstm [5] - 1.38
batch-normalized \glslstm [5] - 1.32
\glsmlstm [15] - 1.27
fast-slow \glslstm [19] 7.2M 1.19
mLSTM 292K 1.11
tmLSTM 292K 1.09
tmGRU 292K 1.08
mGRU 292K 1.07
larger mGRU 2.1M 0.98
Table 1: Test set error on Penn Treebank and parameter counts in character-level language modeling

5.2 Text8

The Text8 corpus [10] comprises the first 100M plain text characters in English from Wikipedia in 2006. As such, the alphabet consists of the 26 letters of the English alphabet as well as the space character. No vocabulary restrictions were put in place. As per [18], the first 90M and 5M characters were used for training and validation, respectively, with the last 5M used for testing.

Model Parameter count Error (BPC)
\glsgru [1] 5M 1.53
\glsmrnn [18] - 1.54
\glslstm [5] - 1.43
\glsmlstm [15] 20M 1.42
mLSTM 133K 1.37
batch-normalized \glslstm[5] - 1.36
tmGRU 133K 1.35
tmLSTM 133K 1.35
mGRU 133K 1.35
large \glsmlstm [15] 46M 1.27
larger \glsmgru 877K 1.21
\glslstm [14]* 45M 1.19
Table 2: Test set error on Text8 and parameter counts in character-level language modeling

Encouraged by our results on the Penn Treebank dataset, we opted to use similar configurations. However, as the data is one long sequence of characters, we divide it into sequences of 200 characters. We pass these sequences to the model in slightly larger batches of 50 to speed up computation. Again, the dimensionality of the hidden state for \glsmlstm is set at 450 after the original model, and that of the intermediate state is set to the size of the alphabet. The size of the hidden state is adjusted for the other three models as it was for the PTB experiments. The model is also trained using the Adam optimizer over 10 epochs.

The best model as per validation data over 10 epochs achieves 1.40 \glsbpc on the test data, slightly surpassing an \glsmlstm of smaller hidden-state dimensionality (450) but larger parameter count. Our results are more modest, as are those of the original \glsmlstm. Once again, results do not vary greatly between models.

As with the Penn Treebank, we proceed with building an \glsmgru with both hidden and intermediate state sizes set to 450. This improves performance to 1.21 \glsbpc, setting a new state of the art for this task and surpassing a large \glsmlstm of 1900 units from [15] despite having far fewer parameters (45M to 5M).

For the sake of comparison, results as well as parameter counts of our models and related approaches are presented in Table 2. It should be noted that some of these models employ dynamic evaluation [7], which fits the model further during evaluation. We refer the reader to [14]. These models are indicated by a star.

6 Conclusion

We have found that competitive results can be achieved with \glsplmrnn using small models. We have not found significant differences in the approaches presented, despite added non-intuitive parameter-sharing constraints when controlling for model size. Our results are restricted to character-level language modeling. Along this line of thought, previous work on \glsplmrnn demonstrated their increased potential when compared to their regular variants [23, 15, 21]. We therefore offer other variants as well as a first investigation into their differences. We hope to have evinced the impact of increased flexibility in hidden-state transitions on \glsplrnn sequence-modeling capabilities. Further work in this area is required to transpose these findings into applied tasks in \glsnlp.

References

  • [1] Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR abs/1803.01271 (2018), http://arxiv.org/abs/1803.01271
  • [2] Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2), 157–166 (1994)
  • [3] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078 (2014)
  • [4] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014)
  • [5] Cooijmans, T., Ballas, N., Laurent, C., Courville, A.C.: Recurrent Batch Normalization. CoRR abs/1603.09025 (2016), http://arxiv.org/abs/1603.09025
  • [6] Ghodsi, A., DeNero, J.: An analysis of the ability of statistical language models to capture the structural properties of language. In: Proceedings of the 9th International Natural Language Generation conference. pp. 227–231 (2016)
  • [7] Graves, A.: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013), http://arxiv.org/abs/1308.0850
  • [8] Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems (2016)
  • [9] Hochreiter, S., Schmidhuber, J.: Long Short-term Memory. Neural computation 9(8), 1735–1780 (1997)
  • [10] Hutter, M.: Human Knowledge Compression Contest (2006), http://prize.hutter1.net/
  • [11] Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Daumé III, H.: A neural network for factoid question answering over paragraphs. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 633–644 (2014)
  • [12] Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15). pp. 2342–2350 (2015)
  • [13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
  • [14] Krause, B., Kahembwe, E., Murray, I., Renals, S.: Dynamic evaluation of neural sequence models. CoRR abs/1709.07432 (2017), http://arxiv.org/abs/1709.07432
  • [15] Krause, B., Lu, L., Murray, I., Renals, S.: Multiplicative LSTM for Sequence Modelling. arXiv preprint arXiv:1609.07959 (2016)
  • [16] Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 1412–1421 (2015)
  • [17] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19(2), 313–330 (1993)
  • [18] Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., Cernocky, J.: Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf) (2012)
  • [19] Mujika, A., Meier, F., Steger, A.: Fast-slow recurrent neural networks. In: Advances in Neural Information Processing Systems. pp. 5917–5926 (2017)
  • [20] Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive summarization. CoRR abs/1705.04304 (2017), http://arxiv.org/abs/1705.04304
  • [21] Radford, A., Józefowicz, R., Sutskever, I.: Learning to Generate Reviews and Discovering Sentiment. CoRR abs/1704.01444 (2017)
  • [22] Socher, R., Bengio, Y., Manning, C.: Deep Learning for NLP. Tutorial at Association of Computational Logistics (ACL), 2012, and North American Chapter of the Association of Computational Linguistics (NAACL) (2013)
  • [23] Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 1017–1024 (2011)
  • [24] Taylor, G.W., Hinton, G.E.: Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annual international conference on machine learning. pp. 1025–1032. ACM (2009)