BPE-Dropout: Simple and Effective Subword Regularization

  • 2019-10-29 13:42:56
  • Ivan Provilkov, Dmitrii Emelianenko, Elena Voita
  • 70

Abstract

Subword segmentation is widely used to address the open vocabulary problem inmachine translation. The dominant approach to subword segmentation is Byte PairEncoding (BPE), which keeps the most frequent words intact while splitting therare ones into multiple tokens. While multiple segmentations are possible evenwith the same vocabulary, BPE splits words into unique sequences; this mayprevent a model from better learning the compositionality of words and beingrobust to segmentation errors. So far, the only way to overcome this BPEimperfection, its deterministic nature, was to create another subwordsegmentation algorithm (Kudo, 2018). In contrast, we show that BPE itselfincorporates the ability to produce multiple segmentations of the same word. Weintroduce BPE-dropout - simple and effective subword regularization methodbased on and compatible with conventional BPE. It stochastically corrupts thesegmentation procedure of BPE, which leads to producing multiple segmentationswithin the same fixed BPE framework. Using BPE-dropout during training and thestandard BPE during inference improves translation quality up to 3 BLEUcompared to BPE and up to 0.9 BLEU compared to the previous subwordregularization.

 

Quick Read (beta)

BPE-Dropout: Simple and Effective Subword Regularization

Ivan Provilkov11 1 Equal contribution.  1,2  Dmitrii Emelianenko11 1 Equal contribution.  1,3  Elena Voita1,4

1Yandex, Russia
2Moscow Institute of Physics and Technology, Russia
3National Research University Higher School of Economics, Russia
4University of Amsterdam, Netherlands
{iv-provilkov, dimdi-y, lena-voita}@yandex-team.ru
Abstract

Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm (Kudo, 2018). In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout – simple and effective subword regularization method based on and compatible with conventional BPE. It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.

\SetKwRepeat

Dodowhile

BPE-Dropout: Simple and Effective Subword Regularization


Ivan Provilkov11footnotemark: 1   1,2  Dmitrii Emelianenko11footnotemark: 1   1,3  Elena Voita1,4 1Yandex, Russia 2Moscow Institute of Physics and Technology, Russia 3National Research University Higher School of Economics, Russia 4University of Amsterdam, Netherlands {iv-provilkov, dimdi-y, lena-voita}@yandex-team.ru

1 Introduction

Using subword segmentation has become de-facto standard in Neural Machine Translation (Bojar et al., 2018; Barrault et al., 2019). Byte Pair Encoding (BPE) (Sennrich et al., 2016) is the dominant approach to subword segmentation. It keeps the common words intact while splitting the rare and unknown ones into a sequence of subword units. This potentially allows a model to make use of morphology, word composition and transliteration. BPE effectively deals with an open-vocabulary problem and is widely used due to its simplicity.

(a)
(b)
Figure 1: Segmentation process of the word ‘unrelated’ using (a) BPE, (b) BPE-dropout. Hyphens indicate possible merges (merges which are present in the merge table); merges performed at each iteration are shown in green, dropped – in red.

There is, however, a drawback of BPE in its deterministic nature: it splits words into unique subword sequences, which means that for each word a model observes only one segmentation. Thus, a model is likely not to reach its full potential in exploiting morphology, learning the compositionality of words and being robust to segmentation errors. Moreover, as we will show further, subwords into which rare words are segmented end up poorly understood.

A natural way to handle this problem is to enable multiple segmentation candidates. This was initially proposed by Kudo (2018) as a subword regularization – a regularization method, which is implemented as an on-the-fly data sampling and is not specific to NMT architecture. Since standard BPE produces single segmentation, to realize this regularization the author had to propose a new subword segmentation, different from BPE. However, the introduced approach is rather complicated: it requires training a separate segmentation unigram language model, using EM and Viterbi algorithms, and forbids using conventional BPE.

In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. BPE builds a vocabulary of subwords and a merge table, which specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. During segmentation, words are first split into sequences of characters, then the learned merge operations are applied to merge the characters into larger, known symbols, till no merge can be done (Figure 1(a)). We introduce BPE-dropout – a subword regularization method based on and compatible with conventional BPE. It uses a vocabulary and a merge table built by BPE, but at each merge step, some merges are randomly dropped. This results in different segmentations for the same word (Figure 1(b)). Our method requires no segmentation training in addition to BPE and uses standard BPE at test time, therefore is simple. BPE-dropout is superior compared to both BPE and Kudo (2018) on a wide range of translation tasks, therefore is effective.

Our key contributions are as follows:

  • We introduce BPE-dropout – a simple and effective subword regularization method;

  • We show that our method outperforms both BPE and previous subword regularization on a wide range of translation tasks;

  • We analyze how training with BPE-dropout affects a model and show that it leads to a better quality of learned token embeddings and to a model being more robust to noisy input.

2 Background

In this section, we briefly describe BPE and the concept of subword regularization. We assume that our task is machine translation, where a model needs to predict the target sentence Y given the source sentence X, but the methods we describe are not task-specific.

2.1 Byte Pair Encoding (BPE)

To define a segmentation procedure, BPE (Sennrich et al., 2016) builds a token vocabulary and a merge table. The token vocabulary is initialized with the character vocabulary, and the merge table is initialized with an empty table. First, each word is represented as a sequence of tokens plus a special end of word symbol. Then, the method iteratively counts all pairs of tokens and merges the most frequent pair into a new token. This token is added to the vocabulary, and the merge operation is added to the merge table. This is done until the desired vocabulary size is reached.

The resulting merge table specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. In this way, it defines the segmentation procedure. First, a word is split into distinct characters plus the end of word symbol. Then, the pair of adjacent tokens which has the highest priority is merged. This is done iteratively until no merge from the table is available (Figure 1(a)).

2.2 Subword regularization

Subword regularization (Kudo, 2018) is a training algorithm which integrates multiple segmentation candidates. Instead of maximizing log-likelihood, this algorithm maximizes log-likelihood marginalized over different segmentation candidates. Formally,

=(X,Y)D𝔼xP(x|X)yP(y|Y)logP(y|x,θ), (1)

where x and y are sampled segmentation candidates for sentences X and Y respectively, P(x|X) and P(y|Y) are the probability distributions the candidates are sampled from, and θ is the set of model parameters. In practice, at each training step only one segmentation candidate is sampled.

Since standard BPE segmentation is deterministic, to realize this regularization Kudo (2018) proposed a new subword segmentation. The introduced approach requires training a separate segmentation unigram language model to predict the probability of each subword, EM algorithm to optimize the vocabulary, and Viterbi algorithm to make samples of segmentations.

Subword regularization was shown to achieve significant improvements over the method using a single subword sequence. However, the proposed method is rather complicated and forbids using conventional BPE. This may prevent practitioners from using subword regularization.

3 Our Approach: BPE-Dropout

We show that to realize subword regularization it is not necessary to reject BPE since multiple segmentation candidates can be generated within the BPE framework. We introduce BPE-dropout – a method which exploits the innate ability of BPE to be stochastic. It alters the segmentation procedure while keeping the original BPE merge table. During segmentation, at each merge step some merges are randomly dropped with the probability p. This procedure is described in Algorithm 1.

  Algorithm 1: BPE-dropout   current_split characters from input_word\[email protected]
\Do merges is not empty merges all possible merges of tokens from current_split\[email protected]
\For merge from merges \tccThe only difference from BPE remove merge from merges with the probability p\[email protected]
\If merges is not empty merge select the merge with the highest priority from merges\[email protected]
apply merge to current_split\[email protected]
\Returncurrent_split\[email protected]
 

If p is set to 0, the segmentation is equivalent to the standard BPE; if p is set to 1, the segmentation splits words into distinct characters. The values between 0 and 1 can be used to control the segmentation granularity.

We use p>0 (usually p=0.1) in training time to expose a model to different segmentations and p=0 during inference, which means that at inference time we use the original BPE. We discuss the choice of the value of p in Section 5.

When some merges are randomly forbidden during segmentation, words end up segmented in different subwords; see for example Figure 1(b). We hypothesize that exposing a model to different segmentations may result in better understanding of the whole words as well as their subword units; we will verify this in Section 6.

4 Experimental setup

4.1 Baselines

Our baselines are the standard BPE and the subword regularization by Kudo (2018). Models trained with BPE-dropout use the merge table and the vocabulary built by the baseline BPE.

Subword regularization by Kudo (2018) has segmentation sampling hyperparameters l and α. l specifies how many best segmentations for each word are produced before sampling one of them, α controls the smoothness of the sampling distribution. In the original paper (l=,α=0.2/0.5) and (l=64,α=0.1) were shown to perform best on different datasets. Since overall they show comparable results, in all experiments we use (l=64,α=0.1).

4.2 Data sets and preprocessing

Number of sentences Voc size Batch size The value of p
(train/dev/test) in BPE-dropout
IWSLT15 En Vi 133k / 1553 / 1268 4k 4k 0.1 / 0.1
En Zh 209k / 887 / 1261 4k / 16k 4k 0.1 / 0.6
IWSLT17 En Fr 232k / 890 / 1210 4k 4k 0.1 / 0.1
En Ar 231k / 888 / 1205 4k 4k 0.1 / 0.1
WMT14 En De 4.5M / 3000 / 3003 32k 32k 0.1 / 0.1
ASPEC En Ja 2M / 1700 / 1812 16k 32k 0.1 / 0.6
Table 1: Overview of the datasets and dataset-dependent hyperparametes. (We explain the choice of the value of p for BPE-dropout in Section 5.3.)

We conduct our experiments on a wide range of datasets with different corpora sizes and languages; information about the datasets is summarized in Table 1. These datasets are used in the main experiments (Section 5.1) and were chosen to match the ones used in the prior work (Kudo, 2018). In the additional experiments (Sections 5.2-5.5), we also use random subsets of the WMT14 English-French data; in this case, we specify dataset size for each experiment.

Prior to segmentation, we preprocess all datasets with the standard Moses toolkit.11 1 https://github.com/moses-smt/mosesdecoder However, Chinese and Japanese have no explicit word boundaries, and Moses tokenizer does not segment sentences into words; for these languages, subword segmentations are trained almost from unsegmented raw sentences.

Relying on a recent study of how the choice of vocabulary size influences translation quality (Ding et al., 2019), we choose vocabulary size depending on the dataset size (Table 1).

In training, translation pairs were batched together by approximate sequence length. For the main experiments, the values of batch size we used are given in Table 1 (batch size is the number of source tokens). In the experiments in Sections 5.2, 5.3 and 5.4, for datasets not larger than 500k sentence pairs we use vocabulary size and batch size of 4k, and 32k for the rest.22 2 Large batch size can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update.

4.3 Model and optimizer

The NMT system used in our experiments is Transformer base (Vaswani et al., 2017). More precisely, the number of layers is N=6 with h=8 parallel attention layers, or heads. The dimensionality of input and output is dmodel=512, and the inner-layer of feed-forward networks has dimensionality dff=2048. We use regularization and optimization procedure as described in Vaswani et al. (2017).

4.4 Inference

To produce translations, for all models, we use beam search with the beam of 4 and length normalization of 0.6.

In addition to the main results, Kudo (2018) also report scores using n-best decoding. To translate a sentence, this strategy produces multiple segmentations of a source sentence, generates a translation for each of them, and rescores the obtained translations. While this could be an interesting future work to investigate different sampling and rescoring strategies, in the current study we use 1-best decoding to fit in the standard decoding paradigm.

4.5 Evaluation

For evaluation, we average 5 latest checkpoints and use BLEU (Papineni et al., 2002) computed via SacreBleu33 3 Our SacreBLEU signature is: BLEU+case.lc+ lang.[src-lang]-[dst-lang]+numrefs.1+ smooth.exp+tok.13a+version.1.3.6 (Post, 2018). Since Japanese and Chinese have no explicit word boundaries, prior to computing BLEU we segment Chinese into distinct characters and Japanese using KyTea.44 4 http://www.phontron.com/kytea

5 Experiments

5.1 Main results

BPE Kudo (2018) BPE-dropout
IWSLT15
En-Vi 31.78 32.43 33.27
Vi-En 30.83 32.36 32.99
En-Zh 21.07 23.15 23.27
Zh-En 18.29 21.10 21.45
IWSLT17
En-Fr 39.37 39.45 40.02
Fr-En 38.18 38.88 39.39
En-Ar 13.89 14.43 15.05
Ar-En 31.90 32.80 33.72
WMT14
En-De 27.41 27.82 28.01
De-En 32.69 33.65 34.19
ASPEC
En-Ja 43.69 44.92 44.19
Ja-En 30.77 31.23 31.29
Table 2: BLEU scores. Bold indicates the best score and all scores whose difference from the best is not statistically significant (with p-value of 0.05). (Statistical significance is computed via bootstrapping (Koehn, 2004).)

The results are provided in Table 2. For all datasets, BPE-dropout improves significantly over the standard BPE: more than 3 BLEU for Zh-En, more than 1.5 BLEU for En-Vi, Vi-En, En-Zh, De-En, Ja-En, and 0.4-1.4 BLEU for the rest. The improvements are especially prominent for smaller datasets; we will discuss this further in Section 5.4.

Compared to Kudo (2018), among the 12 datasets we use BPE-dropout is beneficial for 8 datasets with improvements up to 0.92 BLEU, is not significantly different for 3 datasets and underperforms only on En-Ja. While Kudo (2018) uses another segmentation, our method operates within the BPE framework and changes only the way a model is trained. Thus, lower performance of BPE-dropout on En-Ja and only small or insignificant improvements for Ja-En, En-Zh and Zh-En suggest that Japanese and Chinese may benefit from a language-specific segmentation.

Note also that Kudo (2018) report larger improvements over BPE from using their method than we show in Table 2. This might be explained by the fact that Kudo (2018) used large vocabulary size (16k, 32k), which has been shown counterproductive for small datasets (Sennrich and Zhang, 2019; Ding et al., 2019). While this may not be the issue for models trained with subword regularization (see Section 5.4), this causes drastic drop in performance of the baselines.

5.2 Single side vs full regularization

BPE BPE-dropout
src-only dst-only both
250k 26.94 27.98 27.71 28.40
500k 29.28 30.12 29.40 29.89
1m 30.53 31.09 30.62 31.23
4m 33.38 33.89 33.46 33.85
Table 3: BLEU scores for models trained with BPE-dropout on a single side of a translation pair or on both sides. Models trained on random subsets of WMT14 En-Fr dataset. Bold indicates the best score and all scores whose difference from the best is not statistically significant (with p-value of 0.05).

Table 3 shows results for using BPE-dropout only on one side of a translation pair. We select random subsets of different sizes from WMT14 En-Fr data to show how the results are affected by the amount of data.

The results indicate that using BPE-dropout on the source side is more beneficial than on the target side; for the datasets not smaller than 0.5m sentence pairs, BPE-dropout can be used only the source side. We can speculate that it is more important for the model to understand a source sentence than being exposed to different ways to generate the same target sentence.

Since full regularization performs the best for all dataset sizes, in the subsequent experiments we use BPE-dropout on both source and target sides.

5.3 Choice of the value of p

Figure 2: BLEU scores for the models trained with BPE-dropout with different values of p. WMT14 En-Fr, 500k sentence pairs.

Figure 2 shows BLEU scores for the models trained on BPE-dropout with different values of p (the probability of a merge being dropped). Models trained with high values of p are unable to translate due to a large mismatch between training segmentation (which is close to char-level) and inference segmentation (BPE). The best quality is achieved with p=0.1.

In our experiments, we use p=0.1 for all languages except for Chinese and Japanese. For Chinese and Japanese, we take the value of p=0.6 to match the increase in length of segmented sentences for other languages.55 5 Formally, for English/French/etc. with BPE-dropout, p=0.1 sentences become on average about 1.25 times longer compared to segmented with BPE; for Chinese and Japanese, we need to set the value of p to 0.6 to achieve the same increase.

5.4 Varying corpora and vocabulary size

Figure 3: BLEU scores. Models trained on random subsets of WMT14 En-Fr.

Now we will look more closely at how the improvement from using BPE-dropout depends on corpora and vocabulary size.

First, we see that BPE-dropout performs best for all dataset sizes (Figure 3). Next, models trained with BPE-dropout are not sensitive to the choice of vocabulary size: models with 4k and 32k vocabulary show almost the same performance, which is in striking contrast to the standard BPE. This makes BPE-dropout attractive since it allows (i) not to tune vocabulary size for each dataset, (ii) choose vocabulary size depending on the desired model properties: models with smaller vocabularies are beneficial in terms of number of parameters, models with larger vocabularies are beneficial in terms of inference time.66 6 Table 4 shows that inference for models with 4k vocabulary is more than 1.4 times longer than models with 32k vocabulary. Finally, we see that the effect from using BPE-dropout vanishes when a corpora size gets bigger. This is not surprising: the effect of any regularization is less in high-resource settings; however, as we will show later in Section 6.3, when applied to noisy source, models trained with BPE-dropout show substantial improvements up to 2 BLEU even in high-resource settings.

5.5 Inference time and length of generated sequences

Since BPE-dropout produces more fine-grained segmentation, sentences segmented with BPE-dropout are longer; distribution of sentence lengths are shown in Figure 4 (a) (with p=0.1, on average about 1.25 times longer). Thus there is a potential danger that models trained with BPE-dropout may tend to use more fine-grained segmentation in inference and hence to slow inference down. However, in practice this is not the case: distributions of lengths of generated translations for models trained with BPE and with BPE-dropout are close (Figure 4 (b)).

Table 4 confirms these observations and shows that inference time of models trained with BPE-dropout is not substantially different from the ones trained with BPE.

(a)
(b)
Figure 4: Distributions of length (in tokens) of (a) the French part of WMT14 En-Fr test set segmented using BPE or BPE-dropout; and (b) the generated translations for the same test set by models trained with BPE or BPE-dropout.
voc size BPE BPE-dropout
32k 1.0 1.03
4k 1.44 1.46
Table 4: Relative inference time of models trained with different subword segmentation methods. Results obtained by (1) computing averaged over 1000 runs time needed to translate WMT14 En-Fr test set, (2) dividing all results by the smallest of the obtained times.

6 Analysis

In this section, we analyze qualitative differences between models trained with BPE and BPE-dropout. We find, that

  • when using BPE, frequent sequences of characters rarely appear in a segmented text as individual tokens rather than being a part bigger ones; BPE-dropout alleviates this issue;

  • by analyzing the learned embedding spaces, we show that using BPE-dropout leads to a better understanding of rare tokens;

  • as a consequence of the above, models trained with BPE-dropout are more robust to misspelled input.

Figure 5: Examples of nearest neighbours in the source embedding space of models trained with BPE and BPE-dropout Models trained on WMT14 En-Fr (4m).

6.1 Substring frequency

Here we highlight one of the drawbacks of BPE’s deterministic nature: since it splits words into unique subword sequences, only rare words are split into subwords. This forces frequent sequences of characters to mostly appear in a segmented text as part of bigger tokens, and not as individual tokens. To show this, for each token in the BPE vocabulary we calculate how often it appears in a segmented text as an individual token and as a sequence of characters (which may be part of a bigger token or an individual token). Figure 6 shows distribution of the ratio between substring frequency as an individual token and as a sequence of characters (for top-10% most frequent substrings).

For frequent substrings, the distribution of token to substring ratio is clearly shifted to zero, which confirms our hypothesis: frequent sequences of characters rarely appear in a segmented text as individual tokens. When a text is segmented using BPE-dropout with the same vocabulary, this distribution significantly shifts away from zero, meaning that frequent substrings appear in a segmented text as individual tokens more often.

Figure 6: Distribution of token to substring ratio for texts segmented using BPE or BPE-dropout for the same vocabulary of 32k tokens; only 10% most frequent substrings are shown. (Token to substring ratio of a token is the ratio between its frequency as an individual token and as a sequence of characters.)

6.2 Properties of the learned embeddings

Now we will analyze embedding spaces learned by different models. We take embeddings learned by models trained with BPE and BPE-dropout and for each token look at the closest neighbors in the corresponding embedding space. Figure 5 shows several examples. In contrast to BPE, nearest neighbours of a token in the embedding space of BPE-dropout are often tokens that share sequences of characters with the original token. To verify this observation quantitatively, we computed character 4-gram precision of top-10 neighbors: the proportion of those 4-grams of the top-10 closest neighbors which are present among 4-grams of the original token. As expected, embeddings of BPE-dropout have higher character 4-gram precision (0.29) compared to the precision of BPE (0.18).

(a) BPE
(b) BPE-dropout
Figure 7: Visualization of source embeddings. Models trained on WMT14 En-Fr (4m).

This also relates to the study by Gong et al. (2018). For several tasks, they analyze the embedding space learned by a model. The authors find that while a popular token usually has semantically related neighbors, a rare word usually does not: a vast majority of closest neighbors of rare words are rare words. To confirm this, we reduce dimensionality of embeddings by SVD and visualize (Figure 7). For the model trained with BPE, rare tokens are in general separated from the rest; for the model trained with BPE-dropout, this is not the case. While to alleviate this issue Gong et al. (2018) propose to use adversarial training for embedding layers, we showed that a trained with BPE-dropout model does not have this problem.

6.3 Robustness to misspelled input

source BPE BPE-dropout diff
En-De
original 27.41 28.01 +0.6
misspelled 24.45 26.03 +1.58
De-En
original 32.69 34.19 +1.5
misspelled 29.71 32.03 +2.32
En-Fr
original 33.38 33.85 +0.47
misspelled 30.30 32.13 +1.83
Table 5: BLEU scores for models trained on WMT14 dataset evaluated given the original and misspelled source. For En-Fr, models were trained on 4m randomly chosen sentence pairs.

Models trained with BPE-dropout better learn compositionality of words and the meaning of subwords, which suggests that these models have to be more robust to noise. We verify this by measuring the translation quality of models on a test set augmented with synthetic misspellings. We augment the source side of a test set by modifying each word with the probability of 10% by applying one of the predefined operations. The operations we consider are (1) removal of one character from a word, (2) insertion of a random character into a word, (3) substitution of a character in a word with a random one. This augmentation produces words with the edit distance of 1 from the unmodified words. Edit distance is commonly used to model misspellings (Brill and Moore, 2000; Ahmad and Kondrak, 2005; Pinter et al., 2017).

Table 5 shows the translation quality of the models trained on WMT 14 dataset when given the original source and augmented with misspellings. We deliberately chose large datasets, where improvements from using BPE-dropout are smaller. We can see that while for the original test sets the improvements from using BPE-dropout are usually modest, for misspelled test set the improvements are a lot larger: 1.6-2.3 BLEU. This is especially interesting since models have not been exposed to misspellings during training. Therefore, even for large datasets using BPE-dropout can result in substantially better quality for practical applications where input is likely to be noisy.

7 Related work

Closest to our work in motivation is the work by Kudo (2018), who introduced the subword regularization framework and a new segmentation algorithm. Other segmentation algorithms include Creutz and Lagus (2006), Schuster and Nakajima (2012), Chitnis and DeNero (2015), Kunchukuttan and Bhattacharyya (2016), Wu and Zhao (2018), Banerjee and Bhattacharyya (2018).

Regularization techniques are widely used for training deep neural networks. Among regularizations applied to a network weights the most popular are Dropout (Srivastava et al., 2014) and L2 regularization. Data augmentation techniques in natural language processing include dropping tokens at random positions or swapping tokens at close positions (Iyyer et al., 2015; Artetxe et al., 2018; Lample et al., 2018), replacing tokens at random positions with a placeholder token (Xie et al., 2017), replacing tokens at random positions with a token sampled from some distribution (e.g., based on token frequency or a language model) (Fadaee et al., 2017; Xie et al., 2017; Kobayashi, 2018). While BPE-dropout can be thought of as a regularization, our motivation is not to make a model robust by injecting noise. By exposing a model to different segmentations, we want to teach it to better understand the composition of words as well as subwords, and make it more flexible in the choice of segmentation during inference.

Several works study how translation quality depends on a level of granularity of a segmentation (Cherry et al., 2018; Kreutzer and Sokolov, 2018; Ding et al., 2019). Cherry et al. (2018) show that trained long enough character-level models tend to have better quality, but it comes with the increase of computational cost for both training and inference. Kreutzer and Sokolov (2018) find that, given flexibility in choosing segmentation level, the model prefers to operate on (almost) character level. Ding et al. (2019) explore the effect of BPE vocabulary size and find that it is better to use small vocabulary for low-resource setting and large vocabulary for a high-resource setting. Following these observations, in our experiments we use different vocabulary size depending on a dataset size to ensure the strongest baselines.

8 Conclusions

We introduce BPE-dropout – simple and effective subword regularization, which operates within the standard BPE framework. The only difference from BPE is how a word is segmented during model training: BPE-dropout randomly drops some merges from the BPE merge table, which results in different segmentations for the same word. Models trained with BPE-dropout (1) outperform BPE and the previous subword regularization on a wide range of translation tasks, (2) have better quality of learned embeddings, (3) are more robust to noisy input. Future research directions include adaptive dropout rates for different merges and an in-depth analysis of other pathologies in learned token embeddings for different segmentations.

References

  • F. Ahmad and G. Kondrak (2005) Learning a spelling error model from search query logs. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 955–962. External Links: Link Cited by: §6.3.
  • M. Artetxe, G. Labaka, and E. Agirre (2018) Unsupervised statistical machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3632–3642. External Links: Link, Document Cited by: §7.
  • T. Banerjee and P. Bhattacharyya (2018) Meaningless yet meaningful: morphology grounded subword-level NMT. In Proceedings of the Second Workshop on Subword/Character LEvel Models, New Orleans, pp. 55–60. External Links: Link, Document Cited by: §7.
  • L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri (2019) Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 1–61. External Links: Link, Document Cited by: §1.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 272–303. External Links: Link, Document Cited by: §1.
  • E. Brill and R. C. Moore (2000) An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 286–293. External Links: Link, Document Cited by: §6.3.
  • C. Cherry, G. Foster, A. Bapna, O. Firat, and W. Macherey (2018) Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4295–4305. External Links: Link, Document Cited by: §7.
  • R. Chitnis and J. DeNero (2015) Variable-length word encodings for neural translation models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2088–2093. External Links: Link, Document Cited by: §7.
  • M. Creutz and K. Lagus (2006) Morfessor in the morpho challenge. In Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes, pp. 12–17. External Links: Link Cited by: §7.
  • S. Ding, A. Renduchintala, and K. Duh (2019) A call for prudent choice of subword merge operations in neural machine translation. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, Dublin, Ireland, pp. 204–213. External Links: Link Cited by: §4.2, §5.1, §7.
  • M. Fadaee, A. Bisazza, and C. Monz (2017) Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 567–573. External Links: Link, Document Cited by: §7.
  • C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu (2018) Frage: frequency-agnostic word representation. In Advances in neural information processing systems, pp. 1334–1345. External Links: Link Cited by: §6.2.
  • M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III (2015) Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1681–1691. External Links: Link, Document Cited by: §7.
  • S. Kobayashi (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 452–457. External Links: Link, Document Cited by: §7.
  • P. Koehn (2004) Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: Table 2.
  • J. Kreutzer and A. Sokolov (2018) Learning to segment inputs for nmt favors character-level processing. In Proceedings of the 15th International Workshop on Spoken Language Translation, External Links: Link Cited by: §7.
  • T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 66–75. External Links: Link Cited by: BPE-Dropout: Simple and Effective Subword Regularization, §1, §1, §2.2, §2.2, §4.1, §4.1, §4.2, §4.4, §5.1, §5.1, Table 2, §7.
  • A. Kunchukuttan and P. Bhattacharyya (2016) Orthographic syllable as basic unit for SMT between related languages. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1912–1917. External Links: Link, Document Cited by: §7.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations, External Links: Link Cited by: §7.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §4.5.
  • Y. Pinter, R. Guthrie, and J. Eisenstein (2017) Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 102–112. External Links: Link, Document Cited by: §6.3.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Link Cited by: §4.5.
  • M. Schuster and K. Nakajima (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. External Links: Link Cited by: §7.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §1, §2.1.
  • R. Sennrich and B. Zhang (2019) Revisiting low-resource neural machine translation: a case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 211–221. External Links: Link, Document Cited by: §5.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. External Links: Link Cited by: §7.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Los Angeles. External Links: Link Cited by: §4.3.
  • Y. Wu and H. Zhao (2018) Finding better subword segmentation for neural machine translation. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 53–64. External Links: Link Cited by: §7.
  • Z. Xie, S. I. Wang, J. Li, D. L. vy, A. Nie, D. Jurafsky, and A. Y. Ng (2017) Data noising as smoothing in neural network language models. In International Conference on Learning Representations, External Links: Link Cited by: §7.