Improving Pre-Trained Multilingual Models with Vocabulary Expansion

  • 2019-09-26 23:56:07
  • Hai Wang, Dian Yu, Kai Sun, Janshu Chen, Dong Yu
  • 6


Recently, pre-trained language models have achieved remarkable success in abroad range of natural language processing tasks. However, in multilingualsetting, it is extremely resource-consuming to pre-train a deep language modelover large-scale corpora for each language. Instead of exhaustivelypre-training monolingual language models independently, an alternative solutionis to pre-train a powerful multilingual deep language model over large-scalecorpora in hundreds of languages. However, the vocabulary size for eachlanguage in such a model is relatively small, especially for low-resourcelanguages. This limitation inevitably hinders the performance of thesemultilingual models on tasks such as sequence labeling, wherein in-depthtoken-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingualsettings, we investigate two approaches (i.e., joint mapping and mixturemapping) based on a pre-trained multilingual model BERT for addressing theout-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speechtagging, named entity recognition, machine translation quality estimation, andmachine reading comprehension. Experimental results show that using mixturemapping is more promising. To the best of our knowledge, this is the first workthat attempts to address and discuss the OOV issue in multilingual settings.


Quick Read (beta)

ACL 2019 Reviews for Submission #403

Title: Improving Pre-Trained Multilingual Model with Vocabulary Expansion Authors: Hai Wang, Dian Yu, Kai Sun, Jianshu Chen and Dong Yu


Comments: An approach to handle the OOV issue in multilingual BERT is proposed. A great deal of nice experiments were done but ultimately (and in message board discussions) the reviewers agreed there wasn’t enough novelty or result here to justify acceptance.


What is this paper about, what contributions does it make, what are the main strengths and weaknesses?

The paper addresses the OOV problem of multilingual language models. In particular, the authors extend Bert by a mapping from monolingual language models and evaluate it on a variety of tasks, both token-level and sentence-level prediction tasks.

The main strengths of the paper are a sound experimental part with many experiments and results that give a clear picture of the pros and cons of the approach.

The main weakness of the paper, in my opinion, lies in the presentation and motivation of the methods which I find a bit confusing (see my clarification questions below for details) and which I hope the authors will improved for the camera-ready version of the paper.

Reasons to accept

An exploration how BERT performs in multilingual settings and how it could be improved.

Reasons to reject

I don’t see risks with accepting this paper.

Reviewer’s Scores

Overall Recommendation: 4

Questions and Suggestions for the Author(s)

- If I understood the introduction correctly, you do not want to train monolingual models on large-scale corpora because this is time-consuming and resource-intensive. However, isn’t that exactly what you need to do in order to apply your mapping methods? (line 298)

- Typically, OOV problems are investigated at the word level and taken as ”solved” at the subword level. Since byte pair encoding falls back to single characters if needed, I am a bit suprised that OOV seems to be such a big issue with byte pair encoding. Can the authors explain how this happens? Or give examples for OOVs in the different languages? In Table 2, for example, the OOVsw numbers are pretty large for Arabic (possibly because of the different alphabet?) but also for Polish and Swedish.

- In the introduction, you motivate that vocabulary sizes for multilingual models are typically small because of expensive softmax computations. What about using class-based LMs with languages as classes so that you can first predict the language and then the word given that language, as in Mikolov et al. 2011 (”Extensions of recurrent neural network language model”)?

- Line 160: I find it surprising to just ignore the shared subwords. Can you provide numbers please for how many subwords are actually shared among the language you consider? (to see if you can savely ignore them - I would assume that a lot of them are actually shared)

- How can you get ”non-contextualized embeddings” from BERT?


What is this paper about, what contributions does it make, what are the main strengths and weaknesses?

This paper proposes three approaches to address the out-of-vocabulary problem in multilingual BERT: as many languages share the same vocabulary, the vocabulary size for each language (particularly low-resource ones) is comparatively small. The first approach learns a mapping between a language-specific pretrained embedding space (from fastText) and the BERT embedding space, as in Madhyastha et al. (2016). The second approach maps all languages to the English embedding space using a cross-lingual word embedding method MUSE. Then this joint embedding space is again mapped to the BERT embedding space with another transformation. The third approach represents all subwords as a mixture of English subwords as in Gu et al. (2018). The first two approaches are found not to perform competitively, so many of the experiments are done only with the third approach. The paper reports results on POS tagging, code-mixed sequence labeling and reading comprehension (where it creates a new Chi! nese-English dataset) and MT quality estimation. The mixture model slightly outperforms multilingual BERT.


  1. 1.

    The paper addresses a timely and relevant problem.

  2. 2.

    The paper conducts a large number of experiments.


  1. 1.

    One thing I would have liked to see that would motivate the problem that this paper is addressing is an analysis that shows how small the vocabulary actually is. Such an analysis would also help make clear whether this is only a problem that appears in a massively multilingual setting or whether this is already an issue with five or ten languages.

  2. 2.

    A shared multilingual vocabulary is a feature that is not unique to BERT, but can be found in any model that is jointly trained with many languages. It would have been good to compare this approach with another model, either another Transformer-based model that uses subword embeddings such as GPT, an LSTM based model with subword embeddings, or a multilingual NMT model.

  3. 3.

    I found it somewhat irritating that the first two methods are presented in detail and after the first experiment section only the third method is used as the first two do not perform very well. IMO it would strengthen the paper if they were either discussed less or if it was more analyzed why they do not work well. Søgaard et al. (2018) find that embedding spaces learned with different methods cannot be easily mapped onto each other, as discussed in Section 3.2. To control for this effect, it would have been nice to try mapping to an embedding space obtained with the same method (e.g. the English or Chinese BERT) and investigate if the methods still perform poorly.

  4. 4.

    The paper mainly applies existing methods to a novel model.

  5. 5.

    (minor) The baselines in Table 2 are somewhat out-dated. A more recent method to compare against is Yasunaga et al. (NAACL 2018).

Reasons to accept

The community would become aware of some ways to address the out-of-vocabulary setting with state-of-the-art models, even though most of these methods have been proposed before.

Reasons to reject

While the out-of-vocabulary problem in large multilingual models is an important issue, in my opinion this paper leaves too many questions open and misses out on investigating and analyzing important issues.

Reviewer’s Scores

Overall Recommendation: 2

Missing References

Conneau et al. (2017) should be Conneau et al. (2018) (the paper was presented at ICLR 2018).

The paper cites Conneau et al. (2017) several times as a reference for the observation that it is difficult to map embeddings learned with different methods (e.g. in line 340). This is not observed in this paper, as far as I’m aware. Søgaard et al. (2018) should be cited here instead. Søgaard et al. (2018) should also be cited for the bilingual lexicon using identical subwords (line 311).

Søgaard, A., Ruder, S., & Vulić, I. (2018). On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of ACL 2018.


What is this paper about, what contributions does it make, what are the main strengths and weaknesses?

This paper investigates 3 methods for expanding vocabulary for multilingual pretrained word embeddings (such as multilingual BERT) using subword units. The argument is that the vocabulary for each language in pretrained multilingual word embeddings is usually small and it is costly to retrain with bigger vocab. This lead to a high out-of-vocab rate for downstream tasks which lowers the performance. They expand the vocab of target pretrained word embeddings by mapping language specific embeddings to this space. They experimented with 3 mapping methods: independent mapping, joint mapping and mixture mapping on various tasks including both token-level to sequence-level tasks. In overall, they achieved better accuracy compared with the pretrained word embeddings without vocab expansion. I quite like the discussion part in section 3.6 shedding more light on the performance gain. The strength of this paper is the extensive set of experiments but their best model (using mixture map! ping) is already proposed in Gu et al (2018) limiting their contribution.

Reasons to accept

A simple method with good results in many tasks.

Reasons to reject

lack of novelty

Reviewer’s Scores

Overall Recommendation: 3.5