### Abstract

A lack of code-switching data complicates the training of code-switching (CS)language models. We propose an approach to train such CS language models onmonolingual data only. By constraining and normalizing the output projectionmatrix in RNN-based language models, we bring embeddings of different languagescloser to each other. Numerical and visualization results show that theproposed approaches remarkably improve the performance of CS language modelstrained on monolingual data. The proposed approaches are comparable or evenbetter than training CS language models with artificially generated CS data. Weadditionally use unsupervised bilingual word translation to analyze whethersemantically equivalent words in different languages are mapped together.

### Quick Read (beta)

# Training a Code-Switching Language Model with Monolingual Data

###### Abstract

A lack of code-switching data complicates the training of code-switching (CS) language models. We propose an approach to train such CS language models on monolingual data only. By constraining and normalizing the output projection matrix in RNN-based language models, we bring embeddings of different languages closer to each other. Numerical and visualization results show that the proposed approaches remarkably improve the performance of CS language models trained on monolingual data. The proposed approaches are comparable or even better than training CS language models with artificially generated CS data. We additionally use unsupervised bilingual word translation to analyze whether semantically equivalent words in different languages are mapped together.

compat=1.14

Training a Code-Switching Language Model with Monolingual Data

Shun-Po Chuang^{†}^{†}thanks: This work is sponsored by Ministry of Science and Technology., Tzu-Wei Sung, Hung-yi Lee |

Graduate Institute of Communication Engineering, National Taiwan University |

{f04942141, b03902042, hungyilee}@ntu.edu.tw |

Index Terms— Code-Switching, Language Model

## 1 Introduction

Code-switching (CS), which occurs when two or more languages are used within a document or a sentence, is widely observed in multicultural areas. Related research is characterized by a lack of data; the application of prior knowledge [zeng2017improving, 6639306] or additional constraints [li2013improved, ying2014language] would alleviate this issue. Because it is easier to collect monolingual data than CS data, efficiently utilizing a large amount of monolingual data would be a solution to the lack of CS data [hamed2017building]. Recent work [gonen2018language] attempts to train a CS language model using fine-tuning. Similar work [garg2017dual] integrates two monolingual language models (LMs) by introducing a special “switch” token in both languages when training the LM, and further incorporating this within automatic speech recognition (ASR). Other works synthesize additional CS text using the modeled distribution from the data [winata2018learn, yilmaz2018acoustic]. Generative adversarial neural networks [goodfellow2014generative, arjovsky2017wasserstein] learn the CS point distribution from CS text [chang2018code]. In this paper, we propose utilizing constraints to bring word embeddings of different languages closer together in the same latent space, and to normalize each word vector to generally improve the CS LM. Similar constraints are used in end-to-end ASR [khassanov2019constrained], but have not yet been reported for CS language modeling. Related prior work [audhkhasi2017direct, settle2019acoustically] attempts to initialize the word embedding with unit-normalized vectors in ASR but does not keep the unit norm during training. Initial experiments on CS data showed that constraining and normalizing the output projection matrix helps LMs trained on monolingual data to better handle CS data.

## 2 Code-Switching Language Modeling

In our approach, we use monolingual data only for training; CS data is for validation and testing only.

### 2.1 RNN-based Language Model

We adopt a recurrent neural network (RNN) based language model [Mikolov2010RecurrentNN]. Given a sequence of words $[{w}_{1},{w}_{2},\mathrm{\dots},{w}_{T}]$, we obtain predictions ${y}_{i}$ by applying transformation $W$ on RNN hidden states ${h}_{i}$ with softmax computation:

${y}_{i}$ | $=\mathrm{softmax}(W{h}_{i})$ | (1) |

where $i=1,2,\mathrm{\dots},T$ and ${h}_{0}$ is a zero vector. Specifically, the output projection matrix is denoted by $W\in {\mathbb{R}}^{V\times z}$, where $V$ is the vocabulary size and $z$ is the hidden layer size of the RNN. Gradient descent is then used to update the parameters with a cross entropy loss function. Consider two languages $\mathit{L1}$ and $\mathit{L2}$ in CS language modeling: the output projection matrix $W$ is partitioned into ${W}_{1}$ and ${W}_{2}$, with each row indicating the latent representations of each word in $\mathit{L1}$ and $\mathit{L2}$ respectively. With careful organization, the output projection matrix $W$ can be written as $\left[\begin{array}{c}\hfill {W}_{1}\hfill \\ \hfill {W}_{2}\hfill \end{array}\right]$.

### 2.2 Constraints on Output Projection Matrix

By optimizing the LM with $\mathit{L1}$ and $\mathit{L2}$ monolingual data, it is possible to improve the perplexity on both sides. Word embedding distributions have arbitrary shapes based on their language characteristics. Without seeing bilingual word pairs, however, the two distributions may converge into their own shape without correlating to each other. It is difficult to train an LM to switch between languages. To train an LM with only monolingual data, we assume that overlapping embeddings benefit CS language modeling. To this end, we attempt to bring word embeddings of $\mathit{L1}$ and $\mathit{L2}$, that is ${W}_{1}$ and ${W}_{2}$, closer to each other. We constrain ${W}_{1}$ and ${W}_{2}$ in the two ways; Fig. 1 shows an overview of the proposed approach.

#### 2.2.1 Symmetric Kullback–Leibler Divergence

Kullback–Leibler divergence (KLD) is a well-known measurement of the distance between two distributions. Minimizing the KLD between language distributions overlaps the embedding space semantically. We assume that both ${W}_{1}$ and ${W}_{2}$ follow a $z$-dimensional multivariate Gaussian distribution, that is,

${W}_{1}\sim N({\mu}_{1},{\mathrm{\Sigma}}_{1}),$ | ${W}_{2}\sim N({\mu}_{2},{\mathrm{\Sigma}}_{2})$ |

where ${\mu}_{1},{\mu}_{2}\in {\mathbb{R}}^{z}$ and ${\mathrm{\Sigma}}_{1},{\mathrm{\Sigma}}_{2}\in {\mathbb{R}}^{z\times z}$ are the mean vector and co-variance matrix for ${W}_{1}$ and ${W}_{2}$ respectively. Based on the assumption of Gaussian distribution, we can easily compute KLD between ${W}_{1}$ and ${W}_{2}$. Due to the asymmetric characteristic of KLD, we adopt the symmetric form of KLD (SKLD), that is, the sum of KLD between ${W}_{1}$ and ${W}_{2}$ and that between ${W}_{2}$ and ${W}_{1}$:

${L}_{\mathrm{\mathit{S}\mathit{K}\mathit{L}\mathit{D}}}$ | $={\displaystyle \frac{1}{2}}[tr({\mathrm{\Sigma}}_{1}^{-1}{\mathrm{\Sigma}}_{2}+{\mathrm{\Sigma}}_{2}^{-1}{\mathrm{\Sigma}}_{1})$ | ||

$+{({\mu}_{1}-{\mu}_{2})}^{T}({\mathrm{\Sigma}}_{1}^{-1}+{\mathrm{\Sigma}}_{2}^{-1})({\mu}_{1}-{\mu}_{2})-2z].$ |

#### 2.2.2 Cosine Distance

Cosine distance (CD) is a common measurement for semantic evaluation. By minimizing CD, we are attempting to bring the semantic latent space of languages closer. Similar to SKLD, we compute the mean vector ${\mu}_{1}$ and ${\mu}_{2}$ of ${W}_{1}$ and ${W}_{2}$ respectively, and CD between two mean vectors is obtained as

$${L}_{\mathrm{\mathit{C}\mathit{D}}}=1-\frac{{\mu}_{1}\cdot {\mu}_{2}}{\parallel {\mu}_{1}\parallel \parallel {\mu}_{2}\parallel},$$ |

where $\parallel \cdot \parallel $ denotes the ${\mathrm{\ell}}^{2}$ norm. We hypothesize the latent representation of each word in $\mathit{L1}$ and $\mathit{L2}$ is distributed in the same semantic space and overlaps by minimizing SKLD or CD.

### 2.3 Output Projection Matrix Normalization

Apart from the constraints from Section 2.2, we propose normalizing the output projection matrix, that is, each word representation is divided by its ${\mathrm{\ell}}^{2}$ norm to possess unit norm. Note that normalization is independent of constraints, and can be applied together. In normalization, we consider semantically equivalent words ${w}_{j}$ and ${w}_{k}$: the cosine similarity between their latent representation ${v}_{j}$ and ${v}_{k}$ should be 1, implying the angle between them is 0, that is, they have the same orientation. By Eq. (1), we observe that the probabilities ${y}_{i,j}=\frac{\mathrm{exp}({v}_{j}\cdot {h}_{i})}{{\sum}_{m=1}^{V}\mathrm{exp}({v}_{m}\cdot {h}_{i})}$ and ${y}_{i,k}=\frac{\mathrm{exp}({v}_{k}\cdot {h}_{i})}{{\sum}_{m=1}^{V}\mathrm{exp}({v}_{m}\cdot {h}_{i})}$ are not necessarily equal because the magnitude of ${v}_{j}$ and ${v}_{k}$ might not be the same. However, being a unit vector, normalization guarantees that given the same history, the probabilities of two semantically equivalent words generated by the LM will be equal. Thus normalization is helpful for clustering semantically equivalent words in the embedding space, which improves language modeling in general.

## 3 Experimental Setup

### 3.1 Corpus

The South East Asia Mandarin-English (SEAME) corpus [Lyu2010SEAMEAM] was used
for the following experiments.
It can be simply separated into two parts by its literal language.
The first part is monolingual,
containing pure Mandarin and pure English transcriptions,
the two main languages in this corpus.
The second part is code-switching (CS) sentences, where the transcriptions are a mix of words from the two languages.
The original data consists of train, dev_man, and
dev_sgn.^{1}^{1}
1
https://github.com/zengzp0912/SEAME-dev-set
Each split contains monolingual and CS sentences,
but dev_man and dev_sgn are dominated by Mandarin and
English respectively.
We held out 1000 Mandarin, 1000 English, and all CS sentences
(because we needed only monolingual data to train the LM) from
train as the validation set.
The remaining monolingual sentences were for the training set.
Similar to prior work [khassanov2019constrained], we used
dev_man and dev_sgn for testing, but to balance the
Mandarin-to-English ratio, we combined them together as the testing set.

### 3.2 Pseudo Code-switching Training Data

To compare the performance of the constraints and normalization with
an LM trained on CS data, we also introduce
pseudo-CS data training, in which we use monolingual data
to generate artificial CS sentences. Two approaches are used to
generate pseudo-CS data:
Word substitution Given only monolingual data, we randomly replace a word in monolingual sentences with its corresponding word in the other language based on the substitution
probability to produce CS data.
However, this requires a vocabulary mapping between the two languages.
We thus use the bilingual translated pair mapping provided by
MUSE [conneau2017word].^{2}^{2}
2
https://github.com/facebookresearch/MUSE
Note that not all translated words are in our vocabulary set.
Sentence concatenation:
We randomly sample sentences from different languages from the original corpus and
concatenate them to construct a pseudo-CS sentence which we add to
the original monolingual corpus.

### 3.3 Evaluation Metrics

Perplexity (PPL) is a common measurement of language modeling. Lower perplexity indicates higher confidence in the predicted target. To better observe the effects of the techniques proposed above, we computed five kinds of perplexity on the corpus: 1) ZH: PPL of monolingual Mandarin sentences; 2) EN: PPL of monolingual English sentences; 3) CS-PPL: PPL of CS sentences; 4) CSP-PPL: the PPL of CS points, which occur when the language of the next word is different from current word; 5) Overall: the PPL of the whole corpus, including monolingual and CS sentences. Due to the difference between CS-PPL and CSP-PPL, these perplexities are separately measured. Clearly, improvements in CS-PPL do not necessarily translate to improvements in CSP-PPL; as CS sentences often contain a majority of non-CS points, CS-PPL is likely to benefit more from improving monolingual perplexity than from improving CSP-PPL.

\Xhline4 Without normalization | |||||

\Xhline4 (A) Monolingual only | |||||

CS-PPL | CSP-PPL | ZH | EN | Overall | |

(a) Baseline | 424.80 | 1118.88 | 160.40 | 125.41 | 289.20 |

(b) SKLD | 319.71 | 752.03 | 152.66 | 115.50 | 228.79 |

(c )CD | 328.04 | 778.55 | 150.78 | 112.11 | 231.83 |

(B) Pseudo training data – Word substitution | |||||

(d) Baseline | 348.88 | 884.74 | 156.90 | 119.98 | 246.41 |

(e) SKLD | 298.24 | 671.38 | 157.53 | 120.36 | 219.62 |

(f) CD | 296.84 | 680.19 | 156.09 | 117.10 | 217.56 |

(C) Pseudo training data – Sentence concatenation | |||||

(g) Baseline | 340.34 | 831.19 | 160.21 | 138.89 | 248.83 |

(h) SKLD | 289.64 | 628.09 | 152.27 | 126.06 | 216.39 |

(i) CD | 293.98 | 652.35 | 150.76 | 124.05 | 217.83 |

\Xhline4 With normalization | |||||

\Xhline4 (D) Monolingual only | |||||

CS-PPL | CSP-PPL | ZH | EN | Overall | |

(j) Baseline | 311.77 | 754.21 | 123.28 | 90.71 | 212.44 |

(k) SKLD | 277.94 | 601.58 | 130.11 | 96.27 | 197.15 |

(l) CD | 282.24 | 602.35 | 132.94 | 97.86 | 200.33 |

(E) Pseudo training data – Word substitution | |||||

(m) Baseline | 264.93 | 583.65 | 131.31 | 97.50 | 190.79 |

(n) SKLD | 248.87 | 512.27 | 136.85 | 101.12 | 184.14 |

(o) CD | 251.60 | 517.85 | 138.48 | 101.27 | 185.84 |

(F) Pseudo training data – Sentence concatenation | |||||

(p) Baseline | 266.11 | 586.83 | 123.31 | 95.82 | 189.88 |

(q) SKLD | 241.73 | 490.00 | 128.75 | 102.44 | 179.83 |

(r) CD | 247.60 | 499.41 | 128.91 | 103.90 | 183.49 |

### 3.4 Implementation

Due to the limited amount of training data, we adopted only a single recurrent layer with long short-term memory (LSTM) cells for language modeling [sundermeyer2012lstm]. The hidden size for both the input projection and the LSTM cells was set to 300. We used a dropout of 0.3 for better generalization, and trained the models using Adam with an initial learning rate of 0.001. In order to obtain better results, the training procedure was stopped when the overall perplexity on the validation set did not decrease for 10 epochs. All reported results are the average of 3 runs.

## 4 Results

### 4.1 Language Modeling

The results are in Table 1, which contains
results for
(A) the language model trained with monolingual data only;
(B) word substitution with substitution probability;^{3}^{3}
3
We performed grid
search on the substitution probability and 0.2 achieved the lowest perplexity.
and
(C) sentence concatenation as mentioned in Section 3.2.
(D), (E), and (F) are the results after applying the normalization from
Section 2.3 on (A), (B), and (C) respectively.
Baselines in rows (a)(d)(g) represent the language model trained without
constraints or
normalization.^{4}^{4}
4
A smoothed 5-gram model was also evaluated, but
it yielded worse performance than the baseline. Due to limited space, we omit
the results here.
Observing rows (a)(d)(g),
we observe that learning with pseudo-CS sentences indeed helps considerably
in CS perplexity, which is reasonable because the LM has seen CS cases
during training even though the training data is synthetic.
However, comparing rows (b)(c) with (d) and (g) reveals that
after applying additional constraints,
the LM trained on monolingual data only is comparable or even
better in terms of both monolingual (ZH and EN columns) and CS (CS-PPL and
CSP-PPL columns) perplexity than LMs trained with pseudo-CS data.
Whether using monolingual or pseudo-CS data for training,
normalizing the output projection matrix generally improves language modeling.
Even trained with monolingual data only,
normalization also improves CSP-PPL, as shown in rows (a) and (j).
Thus we conclude that the monolingual data in our corpus has a similar sentence structure,
and normalization yields a similar latent space, aiding in switching between languages.
After applying SKLD and normalization together,
the CSP-PPL improves, yielding the best results in the monolingual data training case.
The perplexity of CS points is reduced significantly when constraints are applied
on the output projection matrix by minimizing SKLD or CD without
degrading the performance on monolingual data.
Rows (k)(n)(q) also show that combining the SKLD constraint with normalization
results in the best performance on each kind of perplexity over only
monolingual and pseudo-CS data.

### 4.2 Visualization

In addition to numerical analysis,
we seek to determine if the overlapping level of embedding space is aligned with the
perplexity results.
We applied principal component analysis (PCA) on the output projection matrix,
and then visualized the results on a 2-D plane.
Fig. 2 shows the visualized results of different approaches.
Fig. 1(a) shows that embeddings of two languages are linear
separable with monolingual data only and without applying any proposed
approach.
After synthesizing pseudo-CS data for training as shown in
Fig. 1(b),
the embeddings of the two languages are closer than Fig. 1(a) but without
excessive overlap.
In
Fig. 1(c),
they totally overlap.
This corresponds to the numerical results in Table 1:
the closer the embeddings are,
the lower the perplexity is.^{5}^{5}
5
Due to limited space, we do not show the visualization results of
sentence concatenation/CD which is quite similar to
Fig. 1(b)/1(c).

### 4.3 Unsupervised Bilingual Word Translation

To analyze whether words with equivalent semantics in different languages are mapped together with the proposed approaches, we conducted experiments on unsupervised bilingual word translation. Given a word $w$ existing in the same bilingual pair mapping mentioned in Section 3.2, each word in the other language is ranked according to the cosine similarity of their embeddings. If the translated word of $w$ is ranked as the $r$-th candidate, then the reciprocal rank is $\frac{1}{r}$. The mean reciprocal rank (MRR) is used as an evaluation metric, which is the average of the reciprocal ranks; thus the MRR should be less than 1.0, and the closer to 1.0 the better. The proportion of correct translations that are in the top 10 candidate list ($r\le 10$) is also reported as “[email protected]” [Xing2015NormalizedWE]. In order to mitigate the degradation in performance caused by low-frequency words, we selected words only with a frequency greater than 80, resulting in about 200 vocabulary words in Mandarin and English respectively, and 55 bilingual pairs used for unsupervised bilingual word translation. The results of bilingual word translation are in Table 2. We see performance for Mandarin-English translation (column (A)) in both MRR and [email protected] that is worse than that in the reverse direction (column (B)). Row (i) demonstrates that the unconstrained baseline performs poorly, whereas additional constraints and normalization in rows (ii) and (iii) yield significantly improved MRR and [email protected] compared with row (i). This suggests that constraints and normalization for CS language modeling indeed enhance semantic mapping.

\diagbox[width=5.0cm]ApproachMetric | (A) Mandarin $\to $ English | (B) English $\to $ Mandarin | ||

MRR | [email protected] | MRR | [email protected] | |

(i) Baseline | 0.0274 | 5.4% | 0.0718 | 20.0% |

(ii) + Normalization | 0.0554 | 14.5% | 0.0885 | 23.6% |

(iii) SKLD + normalization | 0.1024 | 21.8% | 0.1496 | 30.9% |

(A) Input | (B) Baseline | (C) SKLD + normalization | |

(i) | {CJK*}UTF8gbsn你知道 maybe | {CJK*}UTF8gbsn你知道 maybe i think | {CJK*}UTF8gbsn你知道 maybe {CJK*}UTF8gbsn你要去那边的时候就会 |

(you know maybe) | (you know maybe i think) | (you know maybe when you go there you will) | |

(ii) | they think {CJK*}UTF8gbsn这里 | they think {CJK*}UTF8gbsn这里的时候我就会去了 | they think {CJK*}UTF8gbsn这里 is like a lot of people |

(they think here) | (when they think here i will go) | (they think here is like a lot of people) |

### 4.4 Sentence Generation

We further evaluated the sentence generation ability of language models trained only with monolingual data. Given part of a sentence, we used the language model to complete the sentence. Two generated sentences and their given inputs are shown in Table 3. Our best approach with SKLD constraint and normalization, listed in column (C), switches languages either from English to Mandarin (row (i)) or from Mandarin to English (row (ii)). However, the baseline model in column (B) fails to code-switch from either side.

## 5 Conclusions

In this work, we train a code-switching language model with monolingual data by constraining and normalizing the output projection matrix, yielding improved performance. We also present an analysis of selected results, which shows our approaches help monolingual embedding space overlap and improves the measurements on bilingual word translation evaluation.