Training a code-switching language model with monolingual data

  • 2019-11-14 09:09:10
  • Shun-Po Chuang, Tzu-Wei Sung, Hung-Yi Lee
  • 5

Abstract

A lack of code-switching data complicates the training of code-switching (CS)language models. We propose an approach to train such CS language models onmonolingual data only. By constraining and normalizing the output projectionmatrix in RNN-based language models, we bring embeddings of different languagescloser to each other. Numerical and visualization results show that theproposed approaches remarkably improve the performance of CS language modelstrained on monolingual data. The proposed approaches are comparable or evenbetter than training CS language models with artificially generated CS data. Weadditionally use unsupervised bilingual word translation to analyze whethersemantically equivalent words in different languages are mapped together.

 

Quick Read (beta)

Training a Code-Switching Language Model with Monolingual Data

Abstract

A lack of code-switching data complicates the training of code-switching (CS) language models. We propose an approach to train such CS language models on monolingual data only. By constraining and normalizing the output projection matrix in RNN-based language models, we bring embeddings of different languages closer to each other. Numerical and visualization results show that the proposed approaches remarkably improve the performance of CS language models trained on monolingual data. The proposed approaches are comparable or even better than training CS language models with artificially generated CS data. We additionally use unsupervised bilingual word translation to analyze whether semantically equivalent words in different languages are mapped together.

\pgfplotsset

compat=1.14

Training a Code-Switching Language Model with Monolingual Data

Shun-po Chuangthanks: This work is sponsored by Ministry of Science and Technology., Tzu-wei Sung, Hung-yi Lee
Graduate Institute of Communication Engineering, National Taiwan University
{f04942141, b03902042, hungyilee}@ntu.edu.tw


Index Terms—  Code-Switching, Language Model

1 Introduction

Code-switching (CS), which occurs when two or more languages are used within a document or a sentence, is widely observed in multicultural areas. Related research is characterized by a lack of data; the application of prior knowledge [zeng2017improving, 6639306] or additional constraints [li2013improved, ying2014language] would alleviate this issue. Because it is easier to collect monolingual data than CS data, efficiently utilizing a large amount of monolingual data would be a solution to the lack of CS data [hamed2017building]. Recent work [gonen2018language] attempts to train a CS language model using fine-tuning. Similar work [garg2017dual] integrates two monolingual language models (LMs) by introducing a special “switch” token in both languages when training the LM, and further incorporating this within automatic speech recognition (ASR). Other works synthesize additional CS text using the modeled distribution from the data [winata2018learn, yilmaz2018acoustic]. Generative adversarial neural networks [goodfellow2014generative, arjovsky2017wasserstein] learn the CS point distribution from CS text [chang2018code]. In this paper, we propose utilizing constraints to bring word embeddings of different languages closer together in the same latent space, and to normalize each word vector to generally improve the CS LM. Similar constraints are used in end-to-end ASR [khassanov2019constrained], but have not yet been reported for CS language modeling. Related prior work [audhkhasi2017direct, settle2019acoustically] attempts to initialize the word embedding with unit-normalized vectors in ASR but does not keep the unit norm during training. Initial experiments on CS data showed that constraining and normalizing the output projection matrix helps LMs trained on monolingual data to better handle CS data.

2 Code-Switching Language Modeling

In our approach, we use monolingual data only for training; CS data is for validation and testing only.

2.1 RNN-based Language Model

We adopt a recurrent neural network (RNN) based language model [Mikolov2010RecurrentNN]. Given a sequence of words [w1,w2,,wT], we obtain predictions yi by applying transformation W on RNN hidden states hi with softmax computation:

yi =softmax(Whi) (1)

where i=1,2,,T and h0 is a zero vector. Specifically, the output projection matrix is denoted by WV×z, where V is the vocabulary size and z is the hidden layer size of the RNN. Gradient descent is then used to update the parameters with a cross entropy loss function. Consider two languages L1 and L2 in CS language modeling: the output projection matrix W is partitioned into W1 and W2, with each row indicating the latent representations of each word in L1 and L2 respectively. With careful organization, the output projection matrix W can be written as [W1W2].

2.2 Constraints on Output Projection Matrix

By optimizing the LM with L1 and L2 monolingual data, it is possible to improve the perplexity on both sides. Word embedding distributions have arbitrary shapes based on their language characteristics. Without seeing bilingual word pairs, however, the two distributions may converge into their own shape without correlating to each other. It is difficult to train an LM to switch between languages. To train an LM with only monolingual data, we assume that overlapping embeddings benefit CS language modeling. To this end, we attempt to bring word embeddings of L1 and L2, that is W1 and W2, closer to each other. We constrain W1 and W2 in the two ways; Fig. 1 shows an overview of the proposed approach.

Fig. 1: Overview of proposed approach. Brackets indicate concatenation operation between W1 and W2

2.2.1 Symmetric Kullback–Leibler Divergence

Kullback–Leibler divergence (KLD) is a well-known measurement of the distance between two distributions. Minimizing the KLD between language distributions overlaps the embedding space semantically. We assume that both W1 and W2 follow a z-dimensional multivariate Gaussian distribution, that is,

W1N(μ1,Σ1), W2N(μ2,Σ2)

where μ1,μ2z and Σ1,Σ2z×z are the mean vector and co-variance matrix for W1 and W2 respectively. Based on the assumption of Gaussian distribution, we can easily compute KLD between W1 and W2. Due to the asymmetric characteristic of KLD, we adopt the symmetric form of KLD (SKLD), that is, the sum of KLD between W1 and W2 and that between W2 and W1:

L𝑆𝐾𝐿𝐷 =12[tr(Σ1-1Σ2+Σ2-1Σ1)
+(μ1-μ2)T(Σ1-1+Σ2-1)(μ1-μ2)-2z].

2.2.2 Cosine Distance

Cosine distance (CD) is a common measurement for semantic evaluation. By minimizing CD, we are attempting to bring the semantic latent space of languages closer. Similar to SKLD, we compute the mean vector μ1 and μ2 of W1 and W2 respectively, and CD between two mean vectors is obtained as

L𝐶𝐷=1-μ1μ2μ1μ2,

where denotes the 2 norm. We hypothesize the latent representation of each word in L1 and L2 is distributed in the same semantic space and overlaps by minimizing SKLD or CD.

2.3 Output Projection Matrix Normalization

Apart from the constraints from Section 2.2, we propose normalizing the output projection matrix, that is, each word representation is divided by its 2 norm to possess unit norm. Note that normalization is independent of constraints, and can be applied together. In normalization, we consider semantically equivalent words wj and wk: the cosine similarity between their latent representation vj and vk should be 1, implying the angle between them is 0, that is, they have the same orientation. By Eq. (1), we observe that the probabilities yi,j=exp(vjhi)m=1Vexp(vmhi) and yi,k=exp(vkhi)m=1Vexp(vmhi) are not necessarily equal because the magnitude of vj and vk might not be the same. However, being a unit vector, normalization guarantees that given the same history, the probabilities of two semantically equivalent words generated by the LM will be equal. Thus normalization is helpful for clustering semantically equivalent words in the embedding space, which improves language modeling in general.

3 Experimental Setup

3.1 Corpus

The South East Asia Mandarin-English (SEAME) corpus [Lyu2010SEAMEAM] was used for the following experiments. It can be simply separated into two parts by its literal language. The first part is monolingual, containing pure Mandarin and pure English transcriptions, the two main languages in this corpus. The second part is code-switching (CS) sentences, where the transcriptions are a mix of words from the two languages. The original data consists of train, dev_man, and dev_sgn.11 1 https://github.com/zengzp0912/SEAME-dev-set Each split contains monolingual and CS sentences, but dev_man and dev_sgn are dominated by Mandarin and English respectively. We held out 1000 Mandarin, 1000 English, and all CS sentences (because we needed only monolingual data to train the LM) from train as the validation set. The remaining monolingual sentences were for the training set. Similar to prior work [khassanov2019constrained], we used dev_man and dev_sgn for testing, but to balance the Mandarin-to-English ratio, we combined them together as the testing set.

3.2 Pseudo Code-switching Training Data

To compare the performance of the constraints and normalization with an LM trained on CS data, we also introduce pseudo-CS data training, in which we use monolingual data to generate artificial CS sentences. Two approaches are used to generate pseudo-CS data: Word substitution Given only monolingual data, we randomly replace a word in monolingual sentences with its corresponding word in the other language based on the substitution probability to produce CS data. However, this requires a vocabulary mapping between the two languages. We thus use the bilingual translated pair mapping provided by MUSE [conneau2017word].22 2 https://github.com/facebookresearch/MUSE Note that not all translated words are in our vocabulary set. Sentence concatenation: We randomly sample sentences from different languages from the original corpus and concatenate them to construct a pseudo-CS sentence which we add to the original monolingual corpus.

3.3 Evaluation Metrics

Perplexity (PPL) is a common measurement of language modeling. Lower perplexity indicates higher confidence in the predicted target. To better observe the effects of the techniques proposed above, we computed five kinds of perplexity on the corpus: 1) ZH: PPL of monolingual Mandarin sentences; 2) EN: PPL of monolingual English sentences; 3) CS-PPL: PPL of CS sentences; 4) CSP-PPL: the PPL of CS points, which occur when the language of the next word is different from current word; 5) Overall: the PPL of the whole corpus, including monolingual and CS sentences. Due to the difference between CS-PPL and CSP-PPL, these perplexities are separately measured. Clearly, improvements in CS-PPL do not necessarily translate to improvements in CSP-PPL; as CS sentences often contain a majority of non-CS points, CS-PPL is likely to benefit more from improving monolingual perplexity than from improving CSP-PPL.

\Xhline4 Without normalization
\Xhline4 (A) Monolingual only
CS-PPL CSP-PPL ZH EN Overall
(a) Baseline 424.80 1118.88 160.40 125.41 289.20
(b) SKLD 319.71 752.03 152.66 115.50 228.79
(c )CD 328.04 778.55 150.78 112.11 231.83
(B) Pseudo training data – Word substitution
(d) Baseline 348.88 884.74 156.90 119.98 246.41
(e) SKLD 298.24 671.38 157.53 120.36 219.62
(f) CD 296.84 680.19 156.09 117.10 217.56
(C) Pseudo training data – Sentence concatenation
(g) Baseline 340.34 831.19 160.21 138.89 248.83
(h) SKLD 289.64 628.09 152.27 126.06 216.39
(i) CD 293.98 652.35 150.76 124.05 217.83
\Xhline4 With normalization
\Xhline4 (D) Monolingual only
CS-PPL CSP-PPL ZH EN Overall
(j) Baseline 311.77 754.21 123.28 90.71 212.44
(k) SKLD 277.94 601.58 130.11 96.27 197.15
(l) CD 282.24 602.35 132.94 97.86 200.33
(E) Pseudo training data – Word substitution
(m) Baseline 264.93 583.65 131.31 97.50 190.79
(n) SKLD 248.87 512.27 136.85 101.12 184.14
(o) CD 251.60 517.85 138.48 101.27 185.84
(F) Pseudo training data – Sentence concatenation
(p) Baseline 266.11 586.83 123.31 95.82 189.88
(q) SKLD 241.73 490.00 128.75 102.44 179.83
(r) CD 247.60 499.41 128.91 103.90 183.49
Table 1: ZH, EN, CS-PPL, CSP-PPL and overall perplexity on testing set

3.4 Implementation

Due to the limited amount of training data, we adopted only a single recurrent layer with long short-term memory (LSTM) cells for language modeling [sundermeyer2012lstm]. The hidden size for both the input projection and the LSTM cells was set to 300. We used a dropout of 0.3 for better generalization, and trained the models using Adam with an initial learning rate of 0.001. In order to obtain better results, the training procedure was stopped when the overall perplexity on the validation set did not decrease for 10 epochs. All reported results are the average of 3 runs.

4 Results

4.1 Language Modeling

The results are in Table 1, which contains results for (A) the language model trained with monolingual data only; (B) word substitution with substitution probability;33 3 We performed grid search on the substitution probability and 0.2 achieved the lowest perplexity. and (C) sentence concatenation as mentioned in Section 3.2. (D), (E), and (F) are the results after applying the normalization from Section 2.3 on (A), (B), and (C) respectively. Baselines in rows (a)(d)(g) represent the language model trained without constraints or normalization.44 4 A smoothed 5-gram model was also evaluated, but it yielded worse performance than the baseline. Due to limited space, we omit the results here. Observing rows (a)(d)(g), we observe that learning with pseudo-CS sentences indeed helps considerably in CS perplexity, which is reasonable because the LM has seen CS cases during training even though the training data is synthetic. However, comparing rows (b)(c) with (d) and (g) reveals that after applying additional constraints, the LM trained on monolingual data only is comparable or even better in terms of both monolingual (ZH and EN columns) and CS (CS-PPL and CSP-PPL columns) perplexity than LMs trained with pseudo-CS data. Whether using monolingual or pseudo-CS data for training, normalizing the output projection matrix generally improves language modeling. Even trained with monolingual data only, normalization also improves CSP-PPL, as shown in rows (a) and (j). Thus we conclude that the monolingual data in our corpus has a similar sentence structure, and normalization yields a similar latent space, aiding in switching between languages. After applying SKLD and normalization together, the CSP-PPL improves, yielding the best results in the monolingual data training case. The perplexity of CS points is reduced significantly when constraints are applied on the output projection matrix by minimizing SKLD or CD without degrading the performance on monolingual data. Rows (k)(n)(q) also show that combining the SKLD constraint with normalization results in the best performance on each kind of perplexity over only monolingual and pseudo-CS data.

(a) Baseline – monolingual
CSP-PPL: 1102
(b) Baseline – word substitution
CSP-PPL: 877
(c) SKLD – monolingual
CSP-PPL: 750
Fig. 2: PCA visualization with different training strategies. Note that the figures are plotted from 1 out of 3 runs.

4.2 Visualization

In addition to numerical analysis, we seek to determine if the overlapping level of embedding space is aligned with the perplexity results. We applied principal component analysis (PCA) on the output projection matrix, and then visualized the results on a 2-D plane. Fig. 2 shows the visualized results of different approaches. Fig. 1(a) shows that embeddings of two languages are linear separable with monolingual data only and without applying any proposed approach. After synthesizing pseudo-CS data for training as shown in Fig. 1(b), the embeddings of the two languages are closer than Fig. 1(a) but without excessive overlap. In Fig. 1(c), they totally overlap. This corresponds to the numerical results in Table 1: the closer the embeddings are, the lower the perplexity is.55 5 Due to limited space, we do not show the visualization results of sentence concatenation/CD which is quite similar to Fig. 1(b)/1(c).

4.3 Unsupervised Bilingual Word Translation

To analyze whether words with equivalent semantics in different languages are mapped together with the proposed approaches, we conducted experiments on unsupervised bilingual word translation. Given a word w existing in the same bilingual pair mapping mentioned in Section 3.2, each word in the other language is ranked according to the cosine similarity of their embeddings. If the translated word of w is ranked as the r-th candidate, then the reciprocal rank is 1r. The mean reciprocal rank (MRR) is used as an evaluation metric, which is the average of the reciprocal ranks; thus the MRR should be less than 1.0, and the closer to 1.0 the better. The proportion of correct translations that are in the top 10 candidate list (r10) is also reported as “[email protected]” [Xing2015NormalizedWE]. In order to mitigate the degradation in performance caused by low-frequency words, we selected words only with a frequency greater than 80, resulting in about 200 vocabulary words in Mandarin and English respectively, and 55 bilingual pairs used for unsupervised bilingual word translation. The results of bilingual word translation are in Table 2. We see performance for Mandarin-English translation (column (A)) in both MRR and [email protected] that is worse than that in the reverse direction (column (B)). Row (i) demonstrates that the unconstrained baseline performs poorly, whereas additional constraints and normalization in rows (ii) and (iii) yield significantly improved MRR and [email protected] compared with row (i). This suggests that constraints and normalization for CS language modeling indeed enhance semantic mapping.

\diagbox[width=5.0cm]ApproachMetric (A) Mandarin English (B) English Mandarin
MRR [email protected] MRR [email protected]
(i) Baseline 0.0274 5.4% 0.0718 20.0%
(ii) + Normalization 0.0554 14.5% 0.0885 23.6%
(iii) SKLD + normalization 0.1024 21.8% 0.1496 30.9%
Table 2: Results for unsupervised bilingual word translation using different approaches, all with monolingual training: translation (A) from Mandarin to English and (B) from English to Mandarin.
(A) Input (B) Baseline (C) SKLD + normalization
(i) {CJK*}UTF8gbsn你知道 maybe {CJK*}UTF8gbsn你知道 maybe i think {CJK*}UTF8gbsn你知道 maybe {CJK*}UTF8gbsn你要去那边的时候就会
(you know maybe) (you know maybe i think) (you know maybe when you go there you will)
(ii) they think {CJK*}UTF8gbsn这里 they think {CJK*}UTF8gbsn这里的时候我就会去了 they think {CJK*}UTF8gbsn这里 is like a lot of people
(they think here) (when they think here i will go) (they think here is like a lot of people)
Table 3: Example generated sentences for different approaches, all with monolingual training: the CS point is (i) from English to Mandarin, and (ii) from Mandarin to English. English translation in parentheses.

4.4 Sentence Generation

We further evaluated the sentence generation ability of language models trained only with monolingual data. Given part of a sentence, we used the language model to complete the sentence. Two generated sentences and their given inputs are shown in Table 3. Our best approach with SKLD constraint and normalization, listed in column (C), switches languages either from English to Mandarin (row (i)) or from Mandarin to English (row (ii)). However, the baseline model in column (B) fails to code-switch from either side.

5 Conclusions

In this work, we train a code-switching language model with monolingual data by constraining and normalizing the output projection matrix, yielding improved performance. We also present an analysis of selected results, which shows our approaches help monolingual embedding space overlap and improves the measurements on bilingual word translation evaluation.

References