MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

  • 2019-08-01 12:36:33
  • Marcely Zanon Boito, William N. Havard, Mahault Garnerin, Éric Le Ferrand, Laurent Besacier
  • 0


The CMU Wilderness Multilingual Speech Dataset is a newly publishedmultilingual speech dataset based on recorded readings of the New Testament. Itprovides data to build Automatic Speech Recognition (ASR) and Text-to-Speech(TTS) models for potentially 700 languages. However, the fact that the sourcecontent (the Bible), is the same for all the languages is not exploited todate. Therefore, this article proposes to add multilingual links between speechsegments in different languages, and shares a large and clean dataset of 8,130para-lel spoken utterances across 8 languages (56 language pairs).We name thiscorpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). Thecovered languages (Basque, English, Finnish, French, Hungarian, Romanian,Russian and Spanish) allow researches on speech-to-speech alignment as well ason translation for syntactically divergent language pairs. The quality of thefinal corpus is attested by human evaluation performed on a corpus subset (100utterances, 8 language pairs). Lastly, we showcase the usefulness of the finalproduct on a bilingual speech retrieval task.


Quick Read (beta)

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible


The CMU Wilderness Multilingual Speech Dataset [1] is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Marcely Zanon Boito1, William N. Havard1,2, Mahault Garnerin1,2, Éric Le Ferrand1, Laurent Besacier1

1LIG, Univ. Grenoble Alpes, CNRS, Grenoble INP, F-38000 Grenoble, France

2LIDILEM, Univ. Grenoble Alpes, F-38000 Grenoble, France

[email protected]

Both authors have contributed equally to this paper.

Index Terms: parallel speech corpus, multilingual alignment, speech-to-speech alignment, speech-to-speech translation.

1 Introduction

Recently, a remarkable work introduced the CMU Wilderness Multilingual Speech Dataset [1].11 1 Available at Based on readings of the New Testament from The Faith Comes By Hearing website, it provides data to build ASR and Text-to-Speech (TTS) models for potentially 700 languages. Such a resource allows the community to experiment and to develop speech technologies on an unprecedented number of languages. However, the fact that the initial language material from these monolingual corpora (the Bible) is the same for all languages, thus constituting a multilingual and comparable22 2 Our definition of a comparable corpus in this work is the following: a non-sentence-aligned corpus, parallel at a broader granularity (e.g. chapter, document). spoken corpus, is not exploited to date.

Therefore, this article proposes an automatic pipeline for adding multilingual links between small speech segments in different languages. We apply our method to 8 languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish), resulting in 56 language pairs for which we obtain speech-to-speech, speech-to-text and text-to-text alignments. In order to ensure the quality of the pipeline, a human evaluation was performed on a corpus subset (8 language pairs, 100 sentences) by bilingual native speakers. The current version of our dataset (named MaSS for Multilingual corpus of Sentence-aligned Spoken utterances) is made freely available to the community, together with instructions and scripts allowing the pipeline extension to new languages.33 3 Available at

We believe the obtained corpus can be useful in several applications, such as: speech-to-speech retrieval [2], multilingual speech representation learning [3] and direct speech-to-speech translation (so far, mostly direct speech-to-text translation has been investigated [4, 5, 6, 7]). Moreover, typological and dialectal fields could use this kind of corpus for solving some of the following novel tasks using parallel speech: word alignment, bilingual lexicon extraction and semantic retrieval.

This paper is organized as follows: after briefly presenting related works in Section 2, we review the dataset source and extraction pipeline in Section 3. Section 4 describes the human verification performed and comments on some of the linguistic features present in the covered languages. Section 5 presents a possible application of the dataset: speech-to-speech retrieval. Section 6 then concludes this work.

2 Related Work

2.1 End-to-end Speech Translation

Previous automatic speech-to-text translation (AST) systems operate in two steps: source language speech recognition (ASR) and source-to-target text translation (MT). However, recent works have attempted to build end-to-end AST without using source language transcription during learning or decoding [4, 5] or using it at training time only [7]. Very recently several extensions of these pioneering works were introduced: low-resource AST [8], unsupervised AST [9], end-to-end speech-to-speech translation (Translatotron[10]. Improvements of end-to-end AST were also proposed using weakly supervised data [11], or by adding a second attention mechanism [12].

2.2 Multilingual Approaches

In the mean time, multilingual approaches for speech and language processing are growing ever more popular. They are made possible by the availability of massively parallel language resources covering an increasing number of languages of the world. These resources feed truly multilingual approaches, such as machine translation [13], syntax parsing [14], automatic speech recognition [15, 16], lexical disambiguation [17, 18] and computational dialectology [19].

2.3 Corpora for End-to-end Speech Translation

To date, no dataset is available for multilingual automatic speech translation (no multi-source dataset available, and only a few parallel corpora publicly available44 4 Table 1 at provides a good survey.). For instance, Fisher and Callhome Spanish-English corpora [20] provide 38 hours of speech transcriptions of telephonic conversations aligned with their translations. However, these corpora are only medium size and contain low-bandwidth recordings. Microsoft Speech Language Translation (MSLT) corpus [21] also provides speech aligned to translated text, but this corpus is rather small (less than 8 hours per language). A 236 hours extension of Librispeech with French translations was proposed by [22]. They exploited automatic alignment procedures, first at the text level (between transcriptions and translations), and then between the text and the corresponding audio segments.

Inspired by this work, Di Gangi et al. [23] created MuST-C, a multilingual speech translation corpus for training end-to-end AST systems from English into 8 languages.55 5 Available at Similar in size, the English-Portuguese dataset How2 [24] was created by translating English short tutorials into Portuguese using a crowd-sourcing platform. The remark that can be made on all these corpora is that most of the time English is the only source language, and no (sentence-aligned) multilingual speech corpora including several source languages are available to date.

3 A Large and Clean Subset of Sentence Aligned Spoken Utterances (MaSS)

In this section we present the source material for our multilingual corpus (Section 3.1), we briefly explain the CMU speech-to-text pipeline (Section 3.2), and we detail our speech-to-speech pipeline (Section 3.3).

3.1 The Source Material:

The Faith Comes By Hearing website66 6 Available at (or simply is an online platform that provides audio-books of the Bible with transcriptions in 1,294 languages. These recordings are a collection of field, virtual and partner recordings. In all cases, only native speakers participate in the recordings, and the number of different voices can go from one up to twenty five. Moreover, the recordings can be performed in drama and non-drama fashion, the former being an acted version of the text, corresponding to less tailored realizations. Finally, based on exchanges with the target users (the native community), background music can be added to the recordings.77 7 More information available at In summary, while the written content is always the same across different languages, the corresponding speech can be quite different in terms of realization (drama and non-drama), number of speakers, acoustic quality (field, virtual or partner recordings), and can sometimes contain background noise (music).

3.2 The CMU Wilderness Multilingual Speech Corpus

The CMU Wilderness corpus [1] is a speech dataset containing over 700 different languages for which it provides audio excerpts aligned with their transcription. Each language accounts for around 20 hours of data extracted from readings of the New Testament, and available at the website. Segmentation was made at the sentence level, and alignment between speech and corresponding text can be obtained with the pipeline provided along with the dataset. This pipeline, notably, can process a large amount of languages without using any extra resources such as acoustic models or pronunciation dictionaries.

However, for most of the languages on the website, several recording versions are available, each of them having significant differences in speech content, as explained in Section 3.1. As this pipeline extracted the soundtracks from the defaults links, audio excerpts often contain music, and it is unknown if drama or non-drama versions were selected. Thus, although the quality of the alignment is good for many languages, it could be inaccurate (or noisy) for an unknown subset.

Lastly, the final segmentation from chapters was obtained through the use of punctuation marks. While efficient for a speech-to-text monolingual scenario, this strategy does not allow accurate multilingual alignment, since different languages and translations may result in different sentence segmentation and ordering.

3.3 Our Pipeline: from Speech-to-text to Speech-to-speech Alignment

As far as multilingual alignment is concerned, Bible chapters are inherently aligned at the chapter level. But Bible chapters are very long excerpts, with an average duration of 5 minutes. Alignments at this broad granularity are not relevant for research in speech-to-speech translation or speech-to-speech alignment. Thus, we propose a new extraction methodology that allows us to obtain fully aligned speech segments at a much smaller granularity (segments between 8 to 10 seconds). Our pipeline is summarized in Figure 1 and described below.

Figure 1: The pipeline for a given language in the website.

1. Extracting clean spoken chapters. Starting from the pipeline described in the last section, which provides scripts for downloading audio data and transcriptions from the website, we downloaded all the 260 chapters from the New Testament in several languages. We selected (after having manually sampled the website) non-dramatized versions (as opposed to dramatized) that contain standard speech and pronunciation, and mostly, no background music. The audios are also converted from stereo to mono for the purpose of the following steps.

2. Aligning speech and text for each chapter. For each chapter, we extracted speech-to-text alignments through the Maus forced aligner88 8 Available at [25] online platform. During this step, we kept languages with good audio quality and for which an acoustic model was available in the off-the-shelf forced aligner tool. Our final set was reduced to the following eight languages: Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish.

3. Segmenting chapters into verses. Any written chapter of the Bible is inherently segmented into verses. In order to segment our audio files in such smaller units, we aligned our textgrids files (from step 2) with a written version of the Bible containing verse information. This alignment is rather trivial, since, after removing punctuation, both texts have the same content. After this step, all audio chapters are segmented into verses and receive IDs based on their English chapter name, and their verse number (e.g. “Matthew_chapter1_verse3”).

3.3.1 Result and Comparison

Considering that all chapters consist of the same set of verses, the verse numbers give us a multilingual alignment between all verses for all the language pairs.99 9 This is mostly true, but for a small subset of chapters, due to different Bible versions and different translation approaches, the number of aligned speech verses will differ slightly. Thus, the output of our pipeline is a set of 8,160 audios segments, aligned at verse level, in eight different languages, with an average of 20 hours of speech for each language. Finally, corpus statistics are presented in Table 1.

Table 1: Statistics of the MaSS corpus.
Languages # types # tokens Types per verse Tokens per verse Avg. token length Audio length (h) Avg. verse length (s)
(EN) English 6,471 176,461 18.03 21.52 3.82 18.50 8.27
(ES) Spanish 11,903 168,255 17.90 20.52 4.17 21.49 9.58
(EU) Basque 14,514 128,946 14.88 15.78 5.55 22.76 9.75
(FI) Finnish 18,824 134,827 15.04 16.44 5.66 23.16 10.21
(FR) French 10,080 183,786 19.25 22.36 4.02 19.41 8.62
(HU) Hungarian 20,457 135,254 15.01 16.46 5.07 21.12 9.29
(RO) Romanian 9,581 169,328 18.19 20.61 4.14 23.11 10.16
(RU) Russian 16,758 129,973 14.50 15.82 4.44 22.90 9.70

For justifying the need of extending the approach presented in Section 3.2, Table 2 presents a comparison between our corpus (bottom) output and theirs (top). This comparison takes the speech file numbering on their pipeline as multilingual alignment clue, since no other information is made available. We can observe that by segmenting based on punctuation, the multilingual alignment quickly becomes incorrect: the segmentation on the third file based on a punctuation mark not present on the English text shifts the alignment for the rest of the chapter.

Table 2: A comparison between CMU’s multilingual alignment and ours. Text in italic shows alignment mismatches between English and French. We used a slightly different (non dramatized) version of the Bible, hence the small differences in the displayed texts.
Alignment from [1]
Files French English
00001 Matthieu Matthew
00002 Jésus descend de la montagne et des foules nombreuses le suivent. When he came down from the mountainside, large crowds followed him.
00003 Un lépreux s’approche, il se met à genoux devant Jésus et lui dit : A man with leprosy came and knelt before him and said, “Lord, if you are willing, you can make me clean."
00004 Seigneur, si tu le veux, tu peux me guérir ! Jesus reached out his hand and touched the man. “I am willing," he said. “Be clean!" Immediately he was cured of his leprosy.
Our alignment
Verses French English
00 Matthieu 8 Matthew 8
01 Lorsque Jésus fut descendu de la montagne une grande foule le suivit When he came down from the mountain great crowds followed him
02 Et voici un lépreux s’étant approché se prosterna devant lui et dit : Seigneur si tu le veux tu peux me rendre pur And behold a leper came to him and knelt before him saying Lord if you will you can make me clean
03 Jésus étendit la main le toucha et dit : Je le veux sois pur Aussitôt il fut purifié de sa lèpre And Jesus stretched out his hand and touched him saying I will be clean And immediately his leprosy was cleansed

3.3.2 Reproducibility

The presented pipeline performs automatic verse-level alignment using Bible chapters. All the scripts used in this work are available, together with the resulting dataset.3 For extending it to a new language, here are some recommendations:

  • Bible version: as discussed in Section 3.1, a language can have several versions available on the website. For ensuring the best quality possible, manual inspection in one chapter can be quickly performed to identify a non-dramatized Bible version, but it is not mandatory.

  • Alignment Tool: for generating verse-level alignment, a chapter-level alignment between speech and text is needed. While we use the Maus forced aligner for this task, any aligner able to provide a textgrid file as output can be used at this stage.

4 Resource Evaluation and Analysis

4.1 Human Evaluation: Speech Alignment Quality

Having obtained multilingual alignments between spoken utterances, we attest their quality by performing a human evaluation on a corpus subset, covering the eight language pairs for which we were able to find bilingual judges. We implemented an online evaluation platform with 100 randomly selected verses in these 8 different language pairs. Judges were asked to evaluate the spoken alignments using a scale from 1 to 5 (1 meaning the two audio excerpts do not have any information in common, and 5 meaning they are perfectly aligned). Aiming at the most uniform evaluation possible, we provided guidelines and examples to our evaluators. Transcriptions were also displayed as a cognitive support in evaluation.

The eight language pairs are the following: French-English (FR-EN), French-Spanish (FR-ES), French-Romanian (FR-RO), English-Spanish (EN-ES), English-Finnish (EN-FI), English-Hungarian (EN-HU), English-Romanian (EN-RO) and English-Russian (EN-RU). This selection is a trade-off between the difficulty of finding judges and the desire to provide a good typological variety in our evaluation data. Basque was also chosen due to the fact it is language isolate, that is, a language that has no known connection to any other language. However, we were unable to find judges to perform the evaluation on any language pair including it.

Table 3 summarizes the results of the human evaluation. Evaluation scores are good, with a mean value of 4.41. Moreover, for every language pair evaluated (except for FR-ES), the median score is the maximum score, hence confirming the quality of the alignment. However, when trying to quantify rater’s agreement, we obtained mixed results. Percentage of agreement with tolerance 1 (meaning raters differing by one-scale degree are interpreted as agreeing) varies from 59.6% (EN-RO) to 95.96% (EN-HU).

Table 3: Result of the manual inspection of the speech alignment quality performed on 8 language pairs (100 sentences). Scale is from 1 to 5 (higher is better). Last column refers to the number of evaluators for a given language pair.
σ med min max # Eval.
EN - ES 4.56 0.62 5 3 5 2
EN - FI 4.37 0.92 5 1 5 1
EN - HU 4.44 0.88 5 1 5 2
EN - RO 4.24 0.97 5 1 5 6
EN - RU 4.56 0.83 5 1 5 3
FR - EN 4.38 0.79 5 1 5 5
FR - ES 4.22 0.89 4 2 5 2
FR - RO 4.51 0.90 5 1 5 1
All 4.36 0.88 5 1 5 22

4.2 Corpus Linguistic Analysis

Regarding content, the corpus features languages belonging to different families. These are listed below:

  • Indo-European:

    • Romances: French, Romanian, Spanish

    • Germanic: English

    • Slavic: Russian

  • Uralic:

    • Ugric: Hungarian

    • Finnic: Finnish

  • Language Isolate: Basque

It should be noted that these languages are very different from a typological point of view. First of all, Basque, Finnish, Hungarian, Romanian and Russian mainly use case marking to indicate the function of a word1010 10 Case markers are small grammatical morphemes added to a word base to indicate its grammatical function (eg. subject, object, manner) within a clause/sentence. in a sentence, while English, French and Spanish rely on word position and prepositions for the same purposes. Basque, Finnish and Hungarian are agglutinative languages, while English, French, Romanian, Russian and Spanish are synthetic languages. Thus, for the former group, grammatical markers will bear only one meaning, while in the latter, grammatical markers will bear several meanings at the same time.1111 11 Compare Hungarian “ház-ak-nak (house-Pl-Dat) and Russian “дом-ам" (house-Pl.Dat). Words in agglutinative languages are comparatively longer than their equivalent in synthetics languages.

Basque is even more special as this language features ergative-absolutive marking while the other languages use nominative-accusative marking. In languages using ergative-absolutive marking, the subject of an intransitive verb and the object of a transitive verb are treated alike and receive the same case marker, while the subject of a transitive verb is treated differently than the subject of an intransitive verb. Romanian also presents an interesting morphological characteristic regarding determiners: the definite article is suffixed to the word whereas indefinite articles are usually prefixed, for instance: “un-băiat" (INDEF-boy: “a boy") and “băiat-ul" (boy-DEF: “the boy"). Finnish and Russian on the other hand do not have any article, neither definite nor indefinite.

Another interesting linguistic phenomenon to observe is the existence of grammatical genders. Russian features three genders (feminine, masculine and neutral) whereas French features only two (feminine and masculine), and Basque and Finnish present no grammatical genders at all. From a syntactic point of view, English, French and Spanish have a relatively fixed word order (and mainly follow the Subject-Verb-Object (SVO) pattern), while word order is more flexible in Basque, Finnish, Hungarian, Romanian and Russian.

Due to all the diverse linguistic features described in this section, we believe this dataset could be used for a wide variety of tasks, such as natural language grammar induction from raw speech, automatic typological features retrieval, speech-to-speech translation, and speech-to-speech retrieval. The latter is illustrated on Section 5. Moreover, this dataset could also serve as a benchmark for evaluating computational language documentation techniques that work on speech inputs.

5 Use Case: Multilingual Speech Retrieval Task Baseline

In this section we showcase the usefulness of our corpus on a multilingual setting. We perform speech-to-speech retrieval by adapting a model for visually grounded speech [26], and we discuss the results for our baseline model.

5.1 Task and Model Definition

For performing multilingual speech retrieval, we adapted the model1212 12 Available at proposed by [26]. This model was primary designed to retrieve images from speech utterances, and it is made of two networks: a speech and a image encoder. By projecting both representations to the same shared space, the model is thus able to learn the relationship between speech segments and the image contents. For our speech-to-speech task, we replaced the image encoder by a (second) speech encoder.1313 13 Modified code available at

Both speech encoders consist of a convolution bank [27] followed by two layers of bidirectional LSTM [28] and an attention mechanism [29] which computes a weighted average of the LTSM’s activations. The convolution bank consists of a set of K=16 1D-convolution filters, where the kth convolution has a kernel of width k. Each convolution filter consists of 40 units with ReLU activation and stride of 1. The batch-normed output of each convolution is then stacked and the resulting matrix is linearly projected to fit the LSTM’s input dimension of size 256.

Our model is inputted with Mel filterbank spectrograms (40 mel coefficients with a Hamming window size of 25ms and stride of 10ms), extracted from raw speech. The network is trained to minimize the contrastive loss function in Equation 1, which minimizes the cosine distance d between a verse in a given language A, and its corresponding verse in a given language B. It does so by maximizing the distance between mismatching verses pairs (with a given margin α).1414 14 Code borrowed from G. Chrupała: Thus, verses corresponding to direct translations should lie close in the embedding space. Finally, contrary to [3], who only samples one negative example for each caption, we adopted the same method as [30] and considered every other verse in the batch as a negative example.

L(vA,vB,α)=vA,vB(vAmax[0,α+d(vA,vB)-d(vA,vB)]+vBmax[0,α+d(vA,vB)-d(vA,vB)]) (1)

5.2 Results

We trained an instance of this model for seven language pairs, always keeping English as source language. The 8,160 common verses were randomly split between train (80%), validation (10%) and test (10%) sets. Batches were of size 16, and models were all trained for 100 epochs. Table 4 presents our results for the retrieval task.

Table 4: Recall at top 1, 5, and 10 retrieval. Median rank r~ on a verse-to-verse retrieval task is also provided. Results are reported on the test set (816 verses). Chance recalls are 0.001 ([email protected]), 0.006 ([email protected]) and 0.012 ([email protected]). Chance median r~ is 408.5.
Query [email protected] [email protected] [email protected] r~
EN-EU 0.173 0.395 0.523 9
EN-ES 0.130 0.341 0.469 12
EN-HU 0.116 0.319 0.455 13
EN-RU 0.102 0.308 0.414 16
EN-RO 0.085 0.289 0.396 17
EN-FR 0.092 0.259 0.364 22
EN-FI 0.076 0.202 0.293 26

Results show that, while such a speech-to-speech task is challenging, it is possible to obtain bilingual speech embeddings that perform reasonably well on a multilingual retrieval task. The recall and rank results are far above the chance values. We also scored a simple baseline that uses utterance length to retrieve spoken verses (in other words, it uses only distance between spoken utterances’ lengths to solve the retrieval task). With this baseline, medium ranks are better than chance level (r~=408) but vary from r~=136 (EN-FR) to r~=219 (EN-FI), which is very poor compared to our baseline model. Interestingly, our best results, obtained for EN-EU (r~=9) and EN-ES (r~=12), illustrate that speech-to-speech retrieval task is feasible even for pairs of typologically different languages.

Following this experiment, we investigated the correlation between the median rank and two variables: the quality of the alignment (human evaluation) and the syntactic distance between the language pairs (using the lang2vec library [31]). While there is no correlation between the rank and the syntactic distance, there is a strong negative correlation with respect to the human evaluation (significant for p<0.1). One possible explanation for this result may be that higher quality alignments (measured by the human evaluation x~) lead to a slightly easier corpus for the speech-speech retrieval task (difficulty being measured by the rank r~). If confirmed, this result would suggest that speech-to-speech retrieval scores are a good proxy for rating alignment corpus quality, as performed for text by [32] through the use of NMT.

Table 5: Correlation between median rank and 1) alignment quality (from manual evaluation) 2) syntactic distance between languages (measured with lang2vec).
Languages r~ Quality (x̄) Syntactic dist.
EN - EU 9 NA 0.61
EN - ES 12 4.56 0.40
EN - HU 13 4.44 0.57
EN - RU 16 4.56 0.49
EN - RO 17 4.51 0.53
EN - FR 22 4.38 0.46
EN - FI 26 4.37 0.53
Correlation -0.76 -0.21

6 Conclusion

In this paper, we presented the creation of an automatically generated clean and controlled parallel corpus based on the CMU Wilderness Multilingual Speech Dataset. Our resource, called MaSS, contains 20 hours of speech in eight languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) and presents both speech-to-text and speech-to-speech alignments. The quality of the corpus was verified on a subset of 100 sentences in 8 language pairs by native speakers. The pipeline used for creating this dataset, as well as the computed forced alignments for each of the chosen languages, are openly accessible.3 Only eight languages are currently covered, but we believe the same methodology could easily be applied for extending it to new languages.


  • [1] A. W. Black, “Cmu wilderness multilingual speech dataset,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 5971–5975.
  • [2] L.-s. Lee, J. Glass, H.-y. Lee, and C.-a. Chan, “Spoken content retrieval—beyond cascading speech recognition with text retrieval,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1389–1420, 2015.
  • [3] D. Harwath, G. Chuang, and J. R. Glass, “Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, 2018, pp. 4969–4973. [Online]. Available:
  • [4] A. Berard, O. Pietquin, C. Servan, and L. Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” in NIPS Workshop on End-to-end Learning for Speech and Audio Processing, Barcelona, Spain, December 2016.
  • [5] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-sequence models can directly transcribe foreign speech,” arXiv preprint arXiv:1703.08581, 2017.
  • [6] S. Bansal, H. Kamper, A. Lopez, and S. Goldwater, “Towards speech-to-text translation without speech recognition,” in EACL (short papers), Valence (Spain), 2017.
  • [7] A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, “End-to-End Automatic Speech Translation of Audiobooks,” in ICASSP 2018 - IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada, Apr. 2018. [Online]. Available:
  • [8] S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater, “Pre-training on high-resource speech recognition improves low-resource speech-to-text translation,” CoRR, vol. abs/1809.01431, 2018. [Online]. Available:
  • [9] Y. Chung, W. Weng, S. Tong, and J. Glass, “Towards unsupervised speech-to-text translation,” CoRR, vol. abs/1811.01307, 2018. [Online]. Available:
  • [10] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu, “Direct speech-to-speech translation with a sequence-to-sequence model,” CoRR, vol. abs/1904.06037, 2019. [Online]. Available:
  • [11] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C. Chiu, N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” CoRR, vol. abs/1811.02050, 2018. [Online]. Available:
  • [12] M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Attention-passing models for robust and data-efficient end-to-end speech translation,” CoRR, vol. abs/1904.07209, 2019. [Online]. Available:
  • [13] R. Aharoni, M. Johnson, and O. Firat, “Massively multilingual neural machine translation,” CoRR, vol. abs/1903.00089, 2019. [Online]. Available:
  • [14] J. Nivre, M. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. T. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman, “Universal dependencies v1: A multilingual treebank collection,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016., 2016. [Online]. Available:
  • [15] T. Schultz and T. Schlippe, “Globalphone: Pronunciation dictionaries in 20 languages,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26-31, 2014., 2014, pp. 337–341. [Online]. Available:
  • [16] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Massively multilingual advesarial speech recognition,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  • [17] R. Navigli and S. P. Ponzetto, “Babelnet: Building a very large multilingual semantic network.” in ACL, J. Hajic, S. Carberry, and S. Clark, Eds.    The Association for Computer Linguistics, 2010, pp. 216–225. [Online]. Available:
  • [18] G. Sérasset, “DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF,” Semantic Web – Interoperability, Usability, Applicability, pp. –, 2014, to appear. [Online]. Available:
  • [19] C. Christodoulopoulos and M. Steedman, “A massively parallel corpus: the bible in 100 languages,” Language Resources and Evaluation, vol. 49, no. 2, pp. 375–395, 2015. [Online]. Available:
  • [20] M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur, “Improved speech-to-text translation with the fisher and callhome spanish-english speech translation corpus,” 2013.
  • [21] C. Federmann and W. D. Lewis, “Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german,” 2016.
  • [22] A. C. Kocabiyikoglu, L. Besacier, and O. Kraif, “Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation,” CoRR, vol. abs/1802.03142, 2018. [Online]. Available:
  • [23] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “Must-c: a multilingual speech translation corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Minneapolis, MN, USA, June 2019.
  • [24] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: a large-scale dataset for multimodal language understanding,” in Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL).    NeurIPS, 2018. [Online]. Available:
  • [25] T. Kisler, U. Reichel, and F. Schiel, “Multilingual processing of speech via web services,” Computer Speech & Language, vol. 45, pp. 326 – 347, 2017.
  • [26] D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. R. Glass, “Jointly discovering visual objects and spoken words from raw sensory input.” in ECCV (6), ser. Lecture Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11210.    Springer, 2018, pp. 659–677. [Online]. Available:
  • [27] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” 2017. [Online]. Available:
  • [28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available:
  • [29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in ICLR 2015, San Diego, California, USA, 2015, pp. 3104–3112.
  • [30] G. Chrupała, L. Gelderloos, and A. Alishahi, “Representations of language in a model of visually grounded speech signal,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).    Association for Computational Linguistics, 2017, pp. 613–622.
  • [31] P. Littell, D. R. Mortensen, K. Lin, K. Kairis, C. Turner, and L. Levin, “Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, 2017, pp. 8–14.
  • [32] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. Guzmán, “Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia,” CoRR, vol. abs/1907.05791, 2019. [Online]. Available: