Since Bahdanau et al.  first introduced attention for neural machinetranslation, most sequence-to-sequence models made use of attention mechanisms[2, 3, 4]. While they produce soft-alignment matrices that could be interpretedas alignment between target and source languages, we lack metrics to quantifytheir quality, being unclear which approach produces the best alignments. Thispaper presents an empirical evaluation of 3 main sequence-to-sequence models(CNN, RNN and Transformer-based) for word discovery from unsegmented phonemesequences. This task consists in aligning word sequences in a source languagewith phoneme sequences in a target language, inferring from it wordsegmentation on the target side . Evaluating word segmentation quality canbe seen as an extrinsic evaluation of the soft-alignment matrices producedduring training. Our experiments in a low-resource scenario on Mboshi andEnglish languages (both aligned to French) show that RNNs surprisinglyoutperform CNNs and Transformer for this task. Our results are confirmed by anintrinsic evaluation of alignment quality through the use of Average NormalizedEntropy (ANE). Lastly, we improve our best word discovery model by using analignment entropy confidence measure that accumulates ANE over all theoccurrences of a given alignment pair in the collection.
Quick Read (beta)
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings
Since Bahdanau et al.  first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presents an empirical evaluation of 3 of the main sequence-to-sequence models for word discovery from unsegmented phoneme sequences: CNN, RNN and Transformer-based. This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side . Evaluating word segmentation quality can be seen as an extrinsic evaluation of the soft-alignment matrices produced during training. Our experiments in a low-resource scenario on Mboshi and English languages (both aligned to French) show that RNNs surprisingly outperform CNNs and Transformer for this task. Our results are confirmed by an intrinsic evaluation of alignment quality through the use Average Normalized Entropy (ANE). Lastly, we improve our best word discovery model by using an alignment entropy confidence measure that accumulates ANE over all the occurrences of a given alignment pair in the collection.
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings
Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
Laboratoire d’Informatique de Grenoble, Univ. Grenoble Alpes (UGA), France
School of Computer Science and Electronic Engineering, University of Essex, UK
Institute of Informatics, Federal University of Rio Grande do Sul, Brazil
contact: [email protected]
Index Terms: sequence-to-sequence models, soft-alignment matrices, word discovery, low-resource languages, computational language documentation
Sequence-to-Sequence (S2S) models can solve many tasks where source and target sequences have different lengths. For learning to focus on specific parts of the input at decoding time, most of these models are equipped with attention mechanisms [1, 2, 3, 4, 6]. By-products of the attention are soft-alignment probability matrices, that can be interpreted as alignment between target and source. However, we lack metrics to quantify their quality. Moreover, while these models perform very well in a typical use case, it is not clear how they would be affected by low-resource scenarios.
This paper proposes an empirical evaluation of well-known S2S models for a particular S2S modeling task. This task consists of aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side . We concentrate on three models: Convolutional Neural Networks (CNN) , Recurrent Neural Networks (RNN)  and Transformer-based models . While this word segmentation task can be used for the extrinsic evaluation of the soft-alignment probability matrices produced during S2S learning, we also introduce Average Normalized Entropy (ANE), a task-agnostic confidence metric to quantify the quality of the source-to-target alignments obtained. Experiments performed on a low-resource scenario for two languages (Mboshi and English) using equivalently sized corpora aligned to French, are, to our knowledge, the first empirical evaluation of these well-known S2S models for a word segmentation task. We also illustrate how our entropy-based metric can be used in a language documentation scenario, helping a linguist to efficiently discover types, in an unknown language, from an unsegmented sequence of phonemes. This work is thus also a contribution to the emerging computational language documentation domain [7, 8, 9, 10, 11], whose main goal is the creation of automatic approaches able to help the documentation of the many languages soon to be extinct .
Lastly, studies focused on comprehensive attention mechanisms for NMT [13, 14, 15] lack evaluation of the resulting alignments, and the exceptions  do so for the task of word-to-word alignment in well-resourced languages. Differently, our work is not only an empirical evaluation of NMT models focused on alignment quality, but it also tackles data scarcity of low-resource scenarios.
2 Experimental Settings
2.1 Unsupervised Word Segmentation from Speech
As in language documentation scenarios available corpora usually contain speech in the language to document aligned with translations in a well-resourced language, Godard et al.  introduced a pipeline for performing Unsupervised Word Segmentation (UWS) from speech. The system outputs time-stamps delimiting stretches of speech, associated with class labels, corresponding to real words in the language. The pipeline consists of first transforming speech into a sequence of phonemes, either through Automatic Unit Discovery (e.g. ) or manual transcription. The phoneme sequences, together with their translations, are then fed to an attention-based S2S system that produces soft-alignment probability matrices between target and source languages. The alignment probability distributions between the phonemes and the translation words (as in Figure 1) are used to cluster (segment) together neighbor phonemes whose alignment distribution peaks at the same word. The final speech segmentation is evaluated using the Zero Resource Challenge11 1 Available at http://zerospeech.com/2017. (ZRC) 2017 evaluation suite (track 2).22 2 We increment over  by removing silence labels before training, and using them for segmentation. This results in slightly better scores.
2.2 Parallel Speech Corpora
The parallel speech corpora used in this work are the English-French (EN-FR)  and the Mboshi-French (MB-FR)  parallel corpora. EN-FR corpus is a 33,192 sentences multilingual extension from librispeech , with English audio books automatically aligned to French translations. MB-FR is a 5,130 sentences corpus from the language documentation process of Mboshi (Bantu C25), an endangered language spoken in Congo-Brazzaville. Thus, while the former corpus presents larger vocabulary and longer sentences, the latter presents a more tailored environment, with short sentences and simpler vocabulary. In order to provide a fair comparison, as well as to study the impact of corpus size, the EN-FR corpus was also down-sampled to 5K utterances (to the exact same size than the MB-FR corpus). Sub-sampling was conducted preserving the average number of tokens per sentence, shown in Table 1.
|#types||#tokens||average( token length)||average( #tokens / sentence)|
2.3 Introducing Average Normalized Entropy (ANE)
In this paper, we focus on studying the soft-alignment probability matrices resulting from the learning of S2S models for the UWS task. To assess the overall quality of these matrices without having gold alignment information, we introduce Average Normalized Entropy (ANE).
Definition: Given the source and target pair of lengths and respectively, for every phone , the normalized entropy (NE) is computed considering all possible words in (Equation 1), where is the alignment probability between the phone and the word (a cell in the matrix). The ANE for a sentence is then defined by the arithmetic mean over the resulting NE for every phone from the sequence (Equation 2).
From this definition, we can derive ANE for different granularities (sub or supra-sentential) by accumulating its value for the full corpus, for a single type or for a single token. Corpus ANE will be used to summarize the overall performance of a S2S model on a specific corpus. Token ANE extends ANE to tokens by averaging NE for all phonemes from a single (discovered) token. Type ANE results from averaging the ANE for every token instance of a discovered type. Finally, Alignment ANE is the result of averaging the ANE for every discovered (type, translation word) alignment pair. Intuition that lower ANEs correspond to better alignments is exemplified in Figure 1.
3 Empirical Comparison of S2S Models
We compare three NMT models 3.1, 3.2, 3.3) for UWS, focusing on their ability of aligning words (French) with phonemes (English or Mboshi) in medium-low resource settings. The results, an analysis of the impact of data size and quality, and the correlation between intrinsic (ANE) and extrinsic (boundary F-score) metrics are presented in 3.4. The application of ANE for type discovery in low-resource settings is presented in 3.5.
3.1 RNN: Attention-based Encoder-Decoder
The classic RNN encoder-decoder model  connects a bidirectional encoder with an unidirectional decoder by the use of an alignment module. The RNN encoder learns annotations for every source token, and these are weighted by the alignment module for the generation of every target token. Weights are defined as context vectors, since they capture the importance of every source token for the generation of each target token.
Attention mechanism: a context vector for a decoder step is computed using the set of source annotations and the last state of the decoder network (translation context). The attention is the result of the weighted sum of the source annotations (with ) and their probabilities (3) obtained through a feed-forward network (4).
Transformer  is a fully attentional S2S architecture, which has obtained state-of-the-art results for several NMT shared tasks. It replaces the use of sequential cell units (such as LSTM) by Multi-Head Attention (MHA) operations, which make the architecture considerably faster. Both encoder and decoder networks are stacked layers sets that receive source and target sequences, embedded and concatenated with positional encoding. An encoder layer is made of two sub-layers: a Self-Attention MHA and a feed-forward. A decoder layer is made of three sub-layers: a masked Self-Attention MHA (no access to subsequent positions); an Encoder-Decoder MHA (operation over the encoder stack’s final output and the decoder self-attention output); and a feed-forward sub-layer. Dropout and residual connections are applied between all sub-layers. Final output probabilities are generated by applying a linear projection over the decoder stack’s output, followed by a softmax operation.
Multi-head attention mechanism: attention is seen as a mapping problem: given a pair of key-value vectors and a query vector, the task is the computation of the weighted sum of the given values (output). In this setup, weights are learned by compatibility functions between key-query pairs (of dimension ). For a given query (Q), keys (K) and values (V) set, the Scaled Dot-Product (SDP) Attention function is computed as:
In practice, several attentions are computed for a given QKV set. The QKV set is first projected into different spaces (multiple heads), where the SDP attention is computed in parallel. Resulting values for all heads are then concatenated and once again projected, yielding the layer’s output. (6) and (7) illustrate the process, in which H is the set of heads () and is a linear projection. Self-Attention defines the case where query and values come from same source (learning compatibility functions within the same sequence of elements).
3.3 CNN: Pervasive Attention
Different from the previous models, which are based on encoder-decoder structures interfaced by attention mechanisms, this approach relies on a single 2D CNN across both sequences (no separate coding stages) . Using masked convolutions, an auto-regressive model predicts the next output symbol based on a joint representation of both input and partial output sequences. Given a source-target pair of lengths and respectively, tokens are first embed in and dimensional spaces via look-up tables. Token embeddings and are then concatenated to form a 3D tensor , with , where:
Each convolutional layer of the model produces a tensor of size , where is the number of output channels for that layer. To compute a distribution over the tokens in the output vocabulary, the second dimension of the tensor is used. This dimension is of variable length (given by the input sequence) and it is collapsed by max or average pooling to obtain the tensor of size . Finally, convolution followed by a softmax operation are applied, resulting in the distribution over the target vocabulary for the next output token.
Attention mechanism: joint encoding acts as an attention-like mechanism, since individual source elements are re-encoded as the output is generated. The self-attention approach of  is applied. It computes the attention weight tensor , of size , from the last activation tensor , to pool the elements of the same tensor along the source dimension, as follows:
where and are weight tensors that map the dimensional features in to the attention weights via an dimensional intermediate representation.
3.4 Comparing S2S Architectures
For each S2S architecture, and each of the three corpora, we train five models (runs) with different initialization seeds.33 3 RNN, CNN and Transformer implementations from [22, 2, 23] respectively. Before segmenting, we average the produced matrices from the five different runs as in . Evaluation is done in a bilingual segmentation condition that corresponds to the real UWS task. In addition, we also perform segmentation in a monolingual condition, where a phoneme sequence is segmented with regards to the corresponding word sequence (transcription) in the same language (hence monolingual).44 4 This task can be seen as an automatic extraction of a pronunciation lexicon from parallel words/phonemes sequences. Our networks are optimized for the monolingual task. Across all architectures, we use embeddings of size 64 and batch size of 32 (5K data set), or embeddings of size 128 and batch size of 64 (33K data set). Dropout of 0.5 and early-stopping procedure are applied in all cases. RNN models have only one layer, a bi-directional encoder, and cell size equal to the embedding size, as in . CNN models use the hyper-parameters from  with only 3 layers (5K data set), or 6 (33K data set), and kernel size of 3. Transformer models were optimized starting from the original hyper-parameters of . Best results (among 50 setups) were achieved using 2 heads, 3 layers (encoder and decoder), warm-up of 5K steps, and using cross-entropy loss without label-smoothing. Finally, for selecting which head to use for UWS, we experimented using the last layer’s averaged heads, or by selecting the head with minimum corpus ANE. While the results were not significantly different, we kept the ANE selection.
3.4.1 Unsupervised Word Segmentation Results
The word boundary F-scores55 5 For CNN and RNN, average standard deviation for the bilingual task is of less than 0.8%. For Transformer, it is almost 4%. for the task of UWS from phoneme sequence (in Mboshi or English) are presented in Table 2, with monolingual results shown for information only (topline). Surprisingly, RNN models outperform the more recent (CNN and Transformer) approaches. One possible explanation is the lower number of parameters (for a 5K setup, in average 700K parameters are trained, while CNN needs an additional 30.79% and Transformer 5.31%). However, for 33K setups, CNNs actually need 30% less parameters than RNNs, but still perform worse. Transformer’s low performance could be due to the use of several heads “distributing” alignment information across different matrices. Nonetheless, we evaluated averaged heads and single-head models, and these resulted in significant decreases in performance. This suggests that this architecture may not need to learn explicit alignment to translate, but instead it could be capturing different kinds of linguistic information, as discussed in the original paper and in its examples . Also, on the decoder side, the behavior of the self-attention mechanism on phoneme units is unclear and under-studied so far. For the encoder, Voita et al.  performed after-training encoder head removal based on head confidence, showing that after initial training, most heads were not necessary for maintaining translation performance. Hence, we find the Multi-head mechanism interpretation challenging, and maybe not suitable for a direct word segmentation application, such as our method.
As in , our best UWS method (RNN) for the bilingual task does not reach the performance level of a strong Bayesian baseline  with F-scores of 89.80 (EN33K), 87.93 (EN5K), and 77.00 (MB5K). However, even if we only evaluate word segmentation performance, our neural approaches learn to segment and align, whereas this baseline only learns to segment. Section 3.5 will leverage those alignments for a type discovery task useful in language documentation.
The Pearson’s correlation coefficients between ANE and boundary F-scores for all mono and bilingual runs of all corpora () are (RNN), (CNN), and (Transformer), with p-values smaller than . These strong negative correlations confirm our hypothesis that lower ANEs correspond to sharper and better alignments.
3.4.2 Impact of Data Size and Quality
EN33K and EN5K results of Table 2 allow us to analyze the impact of data size on the S2S models. For the bilingual task, RNN performance drops by 7% on average, whereas performance drop is bigger for CNN (14-15%). Transformer performs poorly in both cases, and increasing data size from 5K to 33K seems to help only for a trivial task (see monolingual results).
The EN5K and MB5K results of Table 2 reflect the impact of language pairs on the S2S models. We know from [26, 27] that English should be easier to segment than Mboshi, and this was confirmed by both dpseg and monolingual results. However, this trend is not confirmed in the bilingual task, where the quality of the (sentence aligned) parallel corpus seems to have more impact (higher boundary F-scores for MB5K than for EN5K for all S2S models). As shown in Table 1, MB-FR corpus has shorter sentences and smaller lexicon diversity, while EN-FR is made of automatically aligned books (noisy alignments), what may explain our experimental results.
3.5 Type Discovery in Low-Resource Settings
|EN 5K||MB 5K|
We investigate the use of Alignment ANE as a confidence measure. From the RNN models, we extract and rank the discovered types by their ANE, and examine if it can be used to separate true words in the discovered vocabulary from the rest. The results for low-resource scenarios (only 5K) in Table 3 suggest that low ANE corresponds to the portion of the discovered vocabulary the network is confident about, and these are, in most of the cases, true discovered lexical items (first row, ).66 6 Type ANE for the retrieval task was also investigated, and results were positive, but slightly worse than the ones from Alignment ANE. As we keep higher Alignment ANE values, we increase recall but loose precision. This suggests that, in a documentation scenario, ANE could be used as a confidence measure by a linguist to extract a list of types with higher precision, without having to pass through all the discovered vocabulary. Moreover, as exemplified for EN5K in Table 4, we also retrieve aligned information (translation candidates) for the generated lexicon.
|Top Low ANE||Top High ANE|
|1||SER1 (sir, )||AH0 (a, convenable)|
|2||HHAH1SH (hush, chut)||IH1 (INV, ah)|
|3||FIH1SHER0 (fisher, fisher)||D (INV, riant)|
|4||KLER1K (clerk, clerc)||N (INV, obéit)|
|5||KIH1S (kiss, embrasse)||YUW1 (you, diable)|
We presented an empirical evaluation of different architectures (RNN, CNN and Transformer) with respect to their capacity to align word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side (UWS task).77 7 Pointers for corpora, parameters and implementations available at https://gitlab.com/mzboito/attention_study Although RNNs have been outperformed by CNN and Transformer-based models for machine translation, for UWS these architectures are still more robust in low-resource scenarios, and present the best segmentation results. We also introduced ANE, an intrinsic measure of alignment quality of S2S models. Accumulating it over the discovered alignments, we showed it can be used as a confidence measure to select true words, increasing Type F-scores.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  M. Elbayad, L. Besacier, and J. Verbeek, “Pervasive attention: 2d convolutional neural networks for sequence-to-sequence prediction,” arXiv preprint arXiv:1808.03867, 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
-  J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1243–1252.
-  P. Godard, M. Zanon Boito, L. Ondel, A. Berard, F. Yvon, A. Villavicencio, and L. Besacier, “Unsupervised word segmentation from speech with attention,” in Interspeech, 2018.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3104–3112. [Online]. Available: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
-  G. Adda, S. Stüker, M. Adda-Decker, O. Ambouroue, L. Besacier, D. Blachon, H. Bonneau-Maynard, P. Godard, F. Hamlaoui, D. Idiatov, G.-N. Kouarata, L. Lamel, E.-M. Makasso, A. Rialland, M. V. de Velde, F. Yvon, and S. Zerbian, “Breaking the unwritten language barrier: The BULB project,” Procedia Computer Science, vol. 81, pp. 8–14, 2016.
-  A. Anastasopoulos and D. Chiang, “A case study on using speech-to-translation alignments for language documentation,” arXiv preprint arXiv:1702.04372, 2017.
-  L. Besacier, B. Zhou, and Y. Gao, “Towards speech translation of non written languages,” in Spoken Language Technology Workshop, 2006. IEEE. IEEE, 2006, pp. 222–225.
-  C. Lignos and C. Yang, “Recession segmentation: simpler online word segmentation using limited resources,” in Proceedings of the fourteenth conference on computational natural language learning. Association for Computational Linguistics, 2010, pp. 88–97.
-  C. Bartels, W. Wang, V. Mitra, C. Richey, A. Kathol, D. Vergyri, H. Bratt, and C. Hung, “Toward human-assisted lexical unit discovery without text resources,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 64–70.
-  P. K. Austin and J. Sallabank, The Cambridge handbook of endangered languages. Cambridge University Press, 2011.
-  K. Song, T. Xu, F. Peng, and J. Lu, “Hybrid self-attention network for machine translation,” arXiv preprint arXiv:1811.00253, 2018.
-  J. Li, Z. Tu, B. Yang, M. R. Lyu, and T. Zhang, “Multi-head attention with disagreement regularization,” arXiv preprint arXiv:1810.10183, 2018.
-  E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418, 2019.
-  T. Zenkel, J. Wuebker, and J. DeNero, “Adding interpretable attention to neural translation models improves word alignment,” arXiv preprint arXiv:1901.11359, 2019.
-  L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Computer Science, vol. 81, pp. 80–86, 2016.
-  A. C. Kocabiyikoglu, L. Besacier, and O. Kraif, “Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation,” arXiv preprint arXiv:1802.03142, 2018.
-  P. Godard, G. Adda, M. Adda-Decker, J. Benjumea, L. Besacier, J. Cooper-Leavitt, G.-N. Kouarata, L. Lamel, H. Maynard, M. Müller et al., “A very low resource language speech corpus for computational language documentation experiments,” arXiv preprint arXiv:1710.03501, 2017.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp. 5206–5210. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178964
-  Z. Lin, M. Feng, C. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” in iclr, 2017.
-  A. Bérard, O. Pietquin, C. Servan, and L. Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” arXiv preprint arXiv:1612.01744, 2016.
-  M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv preprint arXiv:1904.01038, 2019.
-  P. Godard, “Unsupervised word discovery for computational language documentation,” Ph.D. dissertation, Université Paris-Saclay, 2019.
-  S. Goldwater, T. L. Griffiths, and M. Johnson, “A Bayesian framework for word segmentation: Exploring the effects of context,” Cognition, vol. 112, no. 1, pp. 21–54, 2009.
-  A. Fourtassi, B. Börschinger, M. Johnson, and E. Dupoux, “Why is english so easy to segment?” in Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL), 2013, pp. 1–10.
-  A. Rialland, M. E. Aborobongui, M. Adda-Decker, and L. Lamel, “Dropping of the class-prefix consonant, vowel elision and automatic phonological mining in embosi (bantu c 25),” in Selected Proceedings of the 44th Annual Conference on African Linguistics, 2015, pp. 7–10.