Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

  • 2019-11-10 07:11:04
  • Hila Gonen, Yoav Goldberg
  • 0


We focus on the problem of language modeling for code-switched language, inthe context of automatic speech recognition (ASR). Language modeling forcode-switched language is challenging for (at least) three reasons: (1) lack ofavailable large-scale code-switched data for training; (2) lack of a replicableevaluation setup that is ASR directed yet isolates language modelingperformance from the other intricacies of the ASR system; and (3) the relianceon generative modeling. We tackle these three issues: we propose anASR-motivated evaluation setup which is decoupled from an ASR system and thechoice of vocabulary, and provide an evaluation dataset for English-Spanishcode-switching. This setup lends itself to a discriminative training approach,which we demonstrate to work better than generative language modeling. Finally,we explore a variety of training protocols and verify the effectiveness oftraining with large amounts of monolingual data followed by fine-tuning withsmall amounts of code-switched data, for both the generative and discriminativecases.


Quick Read (beta)

Language Modeling for Code-Switching:
Evaluation, Integration of Monolingual Data, and Discriminative Training

Hila Gonen1    Yoav Goldberg1,2
1Department of Computer Science, Bar-Ilan University
2Allen Institute for Artificial Intelligence

We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.

Language Modeling for Code-Switching:
Evaluation, Integration of Monolingual Data, and Discriminative Training

Hila Gonen1and Yoav Goldberg1,2 1Department of Computer Science, Bar-Ilan University 2Allen Institute for Artificial Intelligence {hilagnn,yoav.goldberg}

1 Introduction

This work deals with neural language modeling of code-switched language, motivated by an application to speech recognition. Code-switching (CS) is a linguistic phenomenon defined as “the alternation of two languages within a single discourse, sentence or constituent.” (Poplack, 1980). Since CS is widely used in spoken platforms, dealing with code-switched language becomes an important challenge for automatic speech recognition (ASR) systems. To get a feeling how an ASR system trained on monolingual data performs on code-switched language, we fed the IBM English and Spanish systems11 1 speech-to-text/ with audio files of code-switched conversations from the Bangor Miami Corpus (see Section 7). The results (examples available in Table 1) exhibit two failure modes: (1) Words and sentences in the opposite language are not recognized correctly; and (2) Code-switch points also hurt recognition of words from the main language. This demonstrates the need for designated speech recognition systems for CS. A crucial component in such a CS ASR system is a strong CS language model, which is used to rank the bilingual candidates produced by the ASR system for a given acoustic signal.

Original sentence (audio)
English model output Spanish model output

no pero vino porque he came to check on my machine
No but we no longer he came to check on my machine Cual provino de que en dicha cámara chino

yo le invité pa Easter porque me daba pena el pobre aquí solo sin familia ni nada
feeling betrayed by eastern lap and I bought a new solo seem funny any now y el envite país tampoco nada pero por aquí solo sin familia ni nada

Table 1: Examples of the output of IBM’s English and Spanish speech recognition systems on code-switched audio.

Language models are traditionally evaluated with perplexity. However, this measure suffers from several shortcomings, in particular strong dependence on vocabulary size and lack of ability to directly evaluate scores assigned to malformed sentences. We address these deficiencies by presenting a new evaluation scheme for LM that simulates an ASR setup. Rather than requiring the LM to produce a probability for a gold sentence, we instead present the LM with a set of alternative sentences, including the gold one, and require it to rank the gold one higher. This evaluation is more realistic since it simulates the role a language model plays in the process of converting audio into text (Section 3). We create such an evaluation dataset for English-Spanish CS – SpanglishEval (Section 4).

Additionally, LM for CS poses a unique challenge: while data for training monolingual LMs is easy to obtain in large quantities, CS occurs primarily in spoken language. This severely limits data availability, thus, small amounts of training data are an intrinsic property of CS LM. A natural approach to overcome this problem is to train monolingual language models—for which we have huge amounts of data—for each language separately, and combine them somehow into a code-switched LM. While this is relatively straightforward to do in an n-gram language model, it is not obvious how to perform such an LM combination in a non-markovian, RNN-based language model. We use a protocol for LSTM-based CS LM training which can take advantage of monolingual data (Baheti et al., 2017) and verify its effectiveness (Section 5).

Based on the new evaluation scheme we present, we further propose to learn a model for this ranking task using discriminative training. This model, as opposed to LMs, no longer depends on estimating next-word probabilities for the entire vocabulary. Instead, during training the model is introduced with positive and negative examples and is encouraged to prefer the positive examples over the negative ones. This model gives significantly better results (Section 6).

Our contributions in this work are four-fold: (a) We propose a new, vocabulary-size independent evaluation scheme for LM in general, motivated by ASR. This evaluation scheme is ranking-based and also suits CS LM; (b) We describe a process for automatically creating such datasets, and provide a concrete evaluation dataset for English-Spanish CS (SpanglishEval); (c) We present a model for this new ranking task that uses discriminative training and is decoupled of probability estimations, this model surpasses the standard LMs; (d) We verify the effectiveness of pretraining CS LM for this ranking task with monolingual data and show significant improvement over various baselines. The CS LM evaluation dataset and the code for the model are available at gonenhila/codeswitching-lm.

2 Background


Code-switching (CS) is defined as the use of two languages at the same discourse (Poplack, 1980). The mixing of different languages in various levels has been widely studied from social and linguistic point of view (Auer, 1999; Muysken, 2000; Bullock and Toribio, 2009), and started getting attention also in the NLP community in the past few years (Solorio and Liu, 2008; Adel et al., 2013a; Cotterell et al., 2014).

Below is an example of code-switching between Spanish and English (taken from the Bangor Miami corpus described in Section 7). Translation to English follows:

  • that es su tío that has lived with him like I don’t know how like ya several years…”
    that his uncle who has lived with him like, I don’t know how, like several years already…

Code-switching is becoming increasingly popular, mainly among bilingual communities. Yet, one of the main challenges when dealing with CS is the limited data and its unique nature: it is usually found in non standard platforms such as spoken data and social media and accessing it is not trivial (Çetinoğlu et al., 2016).

Shortcomings of Perplexity-based Evaluation for LM

The most common evaluation measure of language models is perplexity. Given a language model M and a test sequence of words w1,,wN, the perplexity of M over the sequence is defined as:


where M(wi) is the probability the model assigns to wi.

A better model is expected to give higher probability to sentences in the test set, that is, lower perplexity. However, this measure is not always well aligned with the quality of a language model as it should be. For example, Tran et al. Tran et al. (2018) show that RNNs capture long-distance syntactic dependencies better than attention-only LMs, despite having higher (worse) perplexity. Similarly, better perplexities often do not translate to better word-error-rate (WER) scores in an ASR system (Huang et al., 2018).

This highlights a shortcoming of perplexity-based evaluation: the method is rewarded for assigning high probability to gold sentences, but is not directly penalized for assigning high probability to highly implausible sentences. When used in a speech recognition setup, the LM is expected to do just that: score correct sentences above incorrect hypotheses.

Another shortcoming of perplexity-based evaluation is that it requires the compared models to have the same support (in other words, the same vocabulary). Simply adding words to the vocabulary, even if no additional change is done to the model, will likely result in higher perplexity for the same dataset. It is also sketchy to compare perplexities of word-based LMs and character-based LMs for the same reason.

Problem with WER-based evaluation

Evaluating LM models based on the final WER of an ASR system side-steps these two issues: it evaluates the LM on incorrect sentences, and seamlessly compares LMs with different support. However, this makes the evaluation procedure highly dependent on a particular ASR system. This is highly undesirable, as ASR systems are hard to set up and tune and are not standardized. This both conflates the LM performance with performance of other aspects of the ASR system, and makes it hard to replicate the evaluation procedure and fairly compare results across different publications. Indeed, as discussed in Section 10, most previous works on CS LM use an ASR system as part of their evaluation setup, and none of them compares to any other work. Moreover, no standard benchmark or evaluation setup exists for CS LM.

3 An Alternative Evaluation Method

We propose an evaluation technique for language models which is ASR-system-independent, that does take into account incorrect sentences and allows to compare LMs with different support or OOV handling techniques.

We seek a method that meets the following requirements: (1) Prefers language models that prioritize correct sentences over incorrect ones; (2) Does not depend on the support (vocabulary) of the language model; (3) Is independent of and not coupled with a speech recognition system.

To meet these criteria, we propose to assemble a dataset comprising of gold sentences, where each gold sentence is associated with a set of alternative sentences, and require the LM to identify the gold sentence in each set. The alternative sentences in a set should be related to the gold sentence. Following the ASR motivation, we take the alternative set to contain sentences that are phonetically related (similar-sounding) to the gold sentence. This setup simulates the task an LM is intended to perform as a part of an ASR system: given several candidates, all originating from the same acoustic signal, the LM should choose the correct sentence over the others.

A New Evaluation metric

Given this setup, we propose to use the accuracy metric: the percentage of sets in which the LM (or other method) successfully identified the gold sentence among the alternatives.22 2 We chose accuracy over WER as the default metric since in our case, WER should be treated with caution: the alternatives created might be “too close” to the gold sentence (e.g. when only a small fraction of the gold sentence is sampled and replaced) or “too far” (e.g. a Spanish alternative for an English sentence), thus affecting the WER. The natural way of using an LM for identifying the gold sentences is assigning a probability to each sentence in the set, and choosing the one with highest probability. Yet, the scoring mechanism is independent of perplexity, and addresses the two deficiencies of perplexity based evaluation discussed above.

Our proposed evaluation method is similar in concept to the NMT evaluation proposed by Sennrich Sennrich (2016). There, the core idea is to measure whether a reference translation is more probable under an NMT model than a contrastive translation which introduces a specific type of error.

4 Evaluation Dataset

We now turn to construct such an evaluation dataset. One method of obtaining sentence-sets is feeding acoustic waves into an ASR system and tracking the resulting lattices. However, this requires access to the audio signals, as well as a trained system for the relevant languages. We propose instead a generation method that does not require access to an ASR system.

The dataset we create is designed for evaluating English-Spanish CS LMs, but the creation process can be applied to other languages and language pairs.33 3 The requirements for creating an evaluation dataset for a language pair L1,L2 is to have access to code-switched sentences where each word is tagged with a language ID (language ID is not mandatory, but helps in cases in which a word is found in the vocabulary of both languages and is pronounced differently), compatible pronunciation dictionaries and unigram word probabilities for each of the languages.

The process of alternative sentence creation is as follows: (1) Convert a sequence of language-tagged words (either CS or monolingual) into a sequence of the matching phonemes (using pronunciation dictionaries); (2) Decode the sequence of phonemes into new sentences, which include words from either language (possibly both); (3) When decoding, allow minor changes in the sequence of phonemes to facilitate the differences between the languages.

These steps can be easily implemented using composition of finite-state transducers (FSTs).44 4 Specifically, we compose the following FSTs: (a) an FST for converting a sentence into a sequence of phonemes, (b) an FST that allows minor changes in the phoneme sequence, (c) an FST for decoding a sequence of phonemes into a sentence, the inverse of (a).

For each gold sentence (which can be either code-switched or monolingual) we create alternatives of all three types: (a) code-switched sentences, (b) sentences in L1 only, (c) sentences in L2 only.

We created such a dataset for English-Spanish with gold sentences from the Bangor Miami Corpus (Section 7). Figure 1 shows two examples from the dataset. In each, the gold sentence is followed by a single code-switched alternative, a single English alternative, and a single Spanish one (a subset of the full set).

Example 1: Gold: pero i thought ella fue like very good Alt-CS: pero i thought a llevó alike very good Alt-EN: peru i thought a of alike very could Alt-SP: pero azote la fue la que rico Example 2: Gold: vamos a ser juntos twenty four seven Alt-CS: vamos a ser juntos when de for saben Alt-EN: follows a santos twenty for seven Alt-SP: vamos hacer junto sentí for saben

Figure 1: Examples from SpanglishEval – the English-Spanish evaluation dataset. The first sentence in each example is the gold sentence, followed by a generated code-switched alternative, a generated English alternative, and a generated Spanish one. Blue (normal) marks English, while red (italic) marks Spanish.

4.1 Technical details

When creating code-switched alternatives, we want to encourage the creation of sentences that include both languages, and that differ from each other. This is done with scores determined by some heuristics, such as preferring sentences that include more words from the language that was less dominant in the original one and vice versa. As part of this we also try to avoid noise from addition of short frequent words, by encouraging the average word length to be high55 5 The implementation of these heuristics is part of our code that is available online.. We create 1000-best alternatives from the FST, re-score them according to the heuristic and keep the top 10.

We discard sets in which the gold sentence has less than three words (excluding punctuation), and also sets with less than 5 alternatives in one of the three types.

We randomly choose 250 sets in which the gold sentence is code-switched, and 750 sets in which the gold sentence is monolingual, both for the development set and for the test set. This percentage of CS sentences is higher than in the underlying corpus in order to aid a meaningful comparison between the models regarding their ability to prefer gold CS sentences over alternatives. The statistics of the dataset are detailed in Table 2.

Further details regarding the implementation can be found in the Appendix.

dev set test set

no. of sets
1000 1000
no. of sentences 30884 30811
no. of CS alternatives 10000 10000
no. of English alternatives 9999 9993
no. of Spanish alternatives 9885 9818
no. of CS gold sentences 250/1000 250/1000

Table 2: Statistics of the dataset.

5 Using Monolingual Data for CS LM

Data for code-switching is relatively scarce, while monolingual data is easy to obtain. The question then is how do we efficiently incorporate monolingual data when training a CS LM?

We use an effective training protocol (FineTuned) for incorporating monolingual data into the language model, similar to the best protocol introduced in (Baheti et al., 2017). We first pre-train a model with monolingual sentences from both English and Spanish (with shared vocabulary for both languages). This essentially trains two monolingual models, one for English and one for Spanish, but with full sharing of parameters. Note that in this pre-training phase, the model is not exposed to any code-switched sentence.

We then use the little amount of available code-switched data to further train the model, making it familiar with code-switching examples that mix the two languages. This fine-tunning procedure enables the model to learn to correctly combine the two languages in a single sentence.

We show in Section 8 that adding the CS data only at the end, in the described manner, works substantially better than several alternatives, verifying the results from (Baheti et al., 2017).

6 Discriminative Training

Our new evaluation method gives rise to training models that are designated to the ranking task. As our main goal is to choose a single best sentence out of a set of candidate sentences, we can focus on training models that score whole sentences with discriminative training, rather than using the standard probability setting of LMs. Discriminative approach unbinds us from the burden of estimating probability distributions over all the words in the vocabulary and allows us to simply create representations of sentences and match them with scores.

Using negative examples is not straight forward in LM training, but is essential for speech recognition, where the language model needs to distinguish between “good” and “bad” sentences. Our discriminative training is based on the idea of using both positive and negative examples during training. The training data we use is similar in nature to that of our test data: sets of sentences in which only a single one is a genuine example collected from a real CS corpus, while all the others are synthetically created.

During training, we aim to assign the gold sentences with a higher score than the scores of the others. For each sentence, we require the difference between its score and the score of the gold sentence to be as large as its WER with respect to the gold sentence. This way, the farther a sentence is from the gold one, the lower its score is.

Formally, let s1 be the gold sentence, and s2,,sm be the other sentences in that set. The loss of the set is the sum of losses over all sentences, except for the gold one:


where score(si) is computed by the multiplication of the representation of si and a learned vector w:


A sentence is represented with its BiLSTM representation – the concatenation of the final states of forward and backward LSTMs. Formally, a sentence s=w1,,wn is represented as follows:


where is the concatenation operation.

Incorporating monolingual data

In order to use monolingual data in the case of discriminative training, we follow the same protocol: as a first step, we create alternative sentences for each monolingual sentence from the monolingual corpora. We train a model using this data, and as a next step, we fine-tune this model with the sets of sentences that are created from the CS corpus.

7 Empirical Experiments

Models and Baselines

We report results on two models that use discriminative training: CS-only-discriminative only trains on data that is created based on the small code-switched corpus, while Fine-Tuned-discriminative first trains on data created based on the monolingual corpora and is then fine-tuned using the data created from the code-switched corpus.

We compare our models to several baselines, all of which use standard LM training: EnglishOnly-LM and SpanishOnly-LM train on the monolingual data only. Two additional models train on a combination of the code-switched corpus and the two monolingual corpora: the first (All:Shuffled-LM) trains on all sentences (monolingual and code-switched) presented in a random order. The second (All:CS-last-LM) trains each epoch on the monolingual datasets followed by a pass on the small code-switched corpus. The models CS-only-LM and Fine-Tuned-LM are the equivalents of CS-only-discriminative and Fine-Tuned-discriminative but with standard LM training.

Code-switching corpus

We use the Bangor Miami Corpus, consisting of transcripts of conversations by Spanish-speakers in Florida, all of whom are bilingual in English.66 6 php?c=miami

We split the sentences (45,621 in total) to train/dev/test with ratio of 60/20/20 respectively, and evaluate perplexity on the dev and test sets.

The dataset described in Section 4 is based on sentences from the dev and test sets, and serves as our main evaluation method.

Monolingual corpora

The monolingual corpora used for training the English and Spanish monolingual models are taken from the OpenSubtitles2018 corpus (Tiedemann, 2009),77 7 OpenSubtitles2018.php
of subtitles of movies and TV series.

We use 1M lines from each language, with a split of 60/20/20 for train/dev/test, respectively. The test set is reserved for future use. For discriminative training we use 1/6 of the monolingual training data (as creating the data results in roughly 30 sentences per gold sentence).

Additional details on preprocessing and statistics of the data can be found in the Appendix.


We implement our language models in DyNet (Neubig et al., 2017). Our basic configuration is similar to that of Gal and Ghahramani Gal and Ghahramani (2016) with minor changes. It has a standard architecture of a 2-layer LSTM followed by a softmax layer, and the optimization is done with SGD (see Appendix for details).

Tuning of hyper-parameters was done on the PTB corpus, in order to be on par with state-of-the-art models such as that of Merity et al. Merity et al. (2017). We then trained the same LM on our CS corpus with no additional tuning and got perplexity of 44.06, better than the 52.99 of Merity et al. Merity et al. (2017) when using their default parameters on the CS corpus.88 8 awd-lstm-lm We thus make no further tuning of hyper-parameters.

When changing to discriminative training, we perform minimal necessary changes: discarding weight decay and reducing the learning rate to 1. The weight vector in the discriminative setting is learned jointly with the other parameters of the network.

As done in previous work, in order to be able to give a reliable probability to every next-token in the test set, we include all the tokens from the test set in our vocabulary and we do not use the “<unk>” token. We only train those that appear in the train set. Handling of unknown tokens becomes easier with discriminative training, as scores are assigned to full sentences. However, we leave this issue for future work.

8 Results

The results of the different models are presented in Table 3. For each model we report both perplexity and accuracy (except for discriminative training, where perplexity is not valid), where each of them is reported according to the best performing model on that measure (on the dev set). We also report the WER of all models, which correlates perfectly with the accuracy measure.

dev test

perp acc wer perp acc wer

329.68 26.6 30.47 322.26 25.1 29.62

320.92 29.3 32.02 314.04 30.3 32.51

76.64 47.8 14.56 76.97 49.2 14.13

68.00 51.8 13.64 68.72 51.4 13.89

43.20 60.7 12.60 43.42 57.9 12.18

45.61 61.0 12.56 45.79 58.8 12.49

39.76 66.9 10.71 40.11 65.4 10.17

72.0 6.35 70.5 6.70

74.2 5.85 75.5 5.59

Table 3: Results on the dev set and on the test set. “perp” stands for perplexity, “acc” stands for accuracy (in percents), and “wer” stands for word-error-rate.

8.1 Using monolingual data

As mentioned above, both in standard LM and in discriminative training, using monolingual data in a correct manner (Fine-Tuned-LM and Fine-Tuned-discriminative) significantly improves over using solely the code-switching data. In standard LM, adding monolingual data results in an improvement of 7.5 points (improving from an accuracy of 57.9% to 65.4%), and in the discriminative training it results in an improvement of 5 points (improving from an accuracy of 70.5% to 75.5%). Even though both All:shuffled-LM and All:CS-last-LM use the same data as the Fine-Tuned-LM model, they perform even worse than CS-only-LM that does not use the monolingual data at all. This emphasizes that the manner of integration of the monolingual data has a very strong influence.

Note that the Fine-Tuned-LM model also improves perplexity. As perplexity is significantly affected by the size of the vocabulary—and to ensure fair comparison—we also add the additional vocabulary from the monolingual data to CS-only-LM (CS-only+vocab-LM). Extending the vocabulary without training those additional words, results in a 2.37-points loss on the perplexity measure, while our evaluation metric (accuracy) stays essentially the same. This demonstrates the utility of our proposed evaluation compared to using perplexity, allowing it to fairly compare models with different vocabulary sizes.

In order to examine the contribution of the monolingual data, we also experimented with subsets of the code-switching training data. Table 4 depicts the results when using subsets of the CS training data with discriminative training. The less code-switching data we use, the more the effect of using the monolingual data is significant: we gain 8.8, 6.5, 4.2 and 5 more accuracy points with 25%, 50%, 75% and 100% of the data, respectively. In the case of 25% of the data, the Fine-Tuned-discriminative model improves over CS-only-discriminative by 17 relative percents.

8.2 Standard LMs vs. Discriminative Training

In the standard training setting, The Fine-Tuned-LM baseline is the strongest baseline, outperforming all others with an accuracy of 65.4%. Similarly, when using discriminative training, the Fine-Tuned-discriminative model outperforms the CS-only-discriminative model. Note that using discriminative training, even with no additional monolingual data, leads to better performance than that of the best language model: the CS-only-discriminative model achieves an accuracy of 70.5%, 5.1 points more than the accuracy of the Fine-Tuned-LM model. We gain further improvement by adding monolingual data and get an even higher accuracy of 75.5%, which is 10.1 points higher than the best language model.

25% train 50% train 75% train full train

dev test dev test dev test dev test

58.4 58.9 65.2 63.6 70.8 68.8 72.0 70.5

68.4 67.7 71.9 70.1 72.8 73.0 74.2 75.5

Table 4: Results on the dev set and on the test set using discriminative training with only subsets of the code-switched data.


Our evaluation setup in the discriminative training case is not ideal: the negative samples in both the train and test sets are artificially created by the same mechanism. Thus, high performance on the test set in the discriminative case may result from “leakage” in which the model learns to rely on idiosyncrasies and artifcats of the data generation mechanism. As such, these results may not transfer as is to a real-world ASR scenario, and should be considered as an optimistic estimate of the actual gains.99 9 An ideal evaluation setup will use an artificial dataset for training, and a dataset obtained from an acoustic model of a code-switched ASR system for testing. However, such an evaluation setup requires access to a high-quality code-switched acoustic model, which we do not posses nor have the means to obtain. Furthermore, it would also tie the evaluation to a specific ASR system, which we aim to avoid. Nevertheless, we do believe that the accuracy improvements of the discriminative setup are real, and should translate to improvements also in the real-world scenario.

To alleviate the concerns to some extent, we consider the case in which the leakage is in the induced negative words distribution.1010 10 As our generation process, much like an acoustic model, makes only local word replacements, the choice of replacement words is the most salient difference from a real acoustic model, and has the largest leakage potential. We re-train the discriminative model using a BOW representation of the sentence. This results in accuracies of 52.1% and 47.4% for dev and test, respectively, far below those of the basic model. A model that leaks in word distribution would score much higher in this evaluation, indicating that the improvement is likely not due to this form of leakage, and that most of the improvement of the discriminative training is likely to translate to a real ASR scenario.

9 Analysis

Table 5 breaks down the results of the different models according to two conditions: when the gold sentence is code-switched, and when the gold sentence is monolingual.

As expected, the Fine-Tuned-discriminative model is able to prioritize the gold sentence better than all other models, under both conditions. The improvement we get is most significant when the gold sentence is CS: in those cases we get a dramatic improvement of 27.73 accuracy points (a relative improvement of 58%). Note that for the standard LMs, the cases in which the gold sentence is CS are much harder, and they perform badly on this portion of the test set. However, using discriminative learning enable us to get improved performance on both portions of the test set and to get comparable results on both parts.

dev test
CS mono CS mono

45.20 65.87 43.20 62.80

49.60 72.67 47.60 71.33

75.60 70.40 70.80 70.53

70.80 74.40 75.33 75.87

Table 5: Accuracy on the dev set and on the test set, according to the type of the gold sentence in the set: code-switched (CS) vs. monolingual (mono).

A closer examination of the mistakes of the Fine-Tuned-LM and Fine-Tuned-discriminative models reveals the superiority of the discriminative training in various cases. Table 6 presents several examples in which Fine-Tuned-LM prefers a wrong sentence whereas Fine-Tuned-discriminative identifies the gold one. In examples 1–4 the gold sentence was code-switched but Fine-Tuned-LM forced an improbable monolingual one. Examples 5 and 6 show mistakes in monolingual sentences.

While discriminative training is significantly better than the standard LM training, it can still be improved quite a bit. Table 7 lists some of the mistakes of the Fine-Tuned-discriminative model: in examples 1, 2 and 3, the gold sentence was code-switched but the model preferred a monolingual one, in example 4 the model prefers a wrong CS sentence over the gold monolingual one, and in 5 and 6 the model makes mistakes in monolingual sentences.

no. type Gold sentence Choice of Fine-Tuned-LM model

CS mono and in front of everybody me saltó . and in front of everybody muscle too .

CS mono porque the sewer system has them in there porque . porque de ser sistemas de mil de porque .

CS mono entonces what i had done is gone ahead and printed it out . and all says what i had done is gone ahead and printed it out .

CS mono es que creo que quedan como novecientos oportunidades para beta testers . es que creo que que del como novecientos oportunidad esperaba de estar .

mono mono we we stop beyond getting too cruel . we we stop be and getting to cruel .

mono mono en mi casa tengo tanto huevo duro . en mi casa tengo tanto a futuro .

Table 6: Examples of sentences the Fine-Tuned-discriminative model identifies correctly while the Fine-Tuned-LM model does not.
no. type Gold sentence Choice of Fine-Tuned-discriminative model

CS mono que by the way se vino ilegal . que va de esa vino ilegal .

CS mono son un website there . so noon website there .

CS mono no i have never felt nothing close to the espíritu santo never . no i have never felt nothing close to the a spirits on to never .

mono CS Tú sabes que que el cuerpo empieza a sentirse raro . Tú sabes que que el cuerpo empieza sentir ser arrow .

mono mono bueno ya son las las y cuarenta casi casi . bueno ya son las lassie cuarenta que si casi .

mono mono we have a peter lang . we have a bitter lung .

Table 7: Examples of sentences that the Fine-Tuned-discriminative model fails to identify.

10 Related Work


Most prior work on CS focused on Language Identification (LID) (Solorio et al., 2014; Molina et al., 2016) and POS tagging (Solorio and Liu, 2008; Vyas et al., 2014; Ghosh et al., 2016; Barman et al., 2016). In this work we focus on language modeling, which we find more challenging.


Language models have been traditionally created by using the n-grams approach (Brown et al., 1992; Chen and Goodman, 1996). Recently, neural models gained more popularity, both using a feed-forward network for an n-gram language model (Bengio et al., 2003; Morin and Bengio, 2005) and using recurrent architectures that are fed with the sequence of words, one word at a time (Mikolov et al., 2010; Zaremba et al., 2014; Gal and Ghahramani, 2016; Foerster et al., 2017; Melis et al., 2017).

Some work has been done also on optimizing LMs for ASR purposes, using discriminative training. Kuo et al. Kuo et al. (2002), Roark et al. Roark et al. (2007) and Dikici et al. Dikici et al. (2013) all improve LM for ASR by maximizing the probability of the correct candidates. All of them use candidates of ASR systems as “negative” examples and train n-gram LMs or use linear models for classifying candidates. A closer approach to ours is used by Huang et al. Huang et al. (2018). There, they optimize an RNNLM with a discriminative loss as part of training an ASR system. Unlike our proposed model, they still use the standard setting of LM. In addition, their training is coupled with an end-to-end ASR system, in particular, as in previous works, the “negative” examples they use are candidates of that ASR system.

LM for CS

Some work has been done also specifically on LM for code-switching. In Chan et al. Chan et al. (2009), the authors compare different n-gram language models, Vu et al. Vu et al. (2012) suggest to improve language modeling by generating artificial code-switched text. Li and Fung Li and Fung (2012) propose a language model that incorporates a syntactic constraint and combine both a code-switched LM and a monolingual LM in the decoding process of an ASR system. Later on, they also suggest to incorporate a different syntactic constraint and to learn the language model from bilingual data using it (Li and Fung, 2014). Pratapa et al. Pratapa et al. (2018) also use a syntactic constraint to improve LM by augmenting synthetically created CS sentences in which this constraint is not violated. Adel et al. Adel et al. (2013a) introduce an RNN based LM, where the output layer is factorized into languages, and POS tags are added to the input. In Adel et al. Adel et al. (2013b), they further investigate an n-gram based factorized LM where each word in the input is concatenated with its POS tag and its language identifier. Adel et al. Adel et al. (2014, 2015) also investigate the influence of syntactic and semantic features in the framework of factorized language models. Sreeram and Sinha Sreeram and Sinha (2017) also use a factorize LM with the addition of POS tags. Baheti et al. Baheti et al. (2017) explore several different training protocols for CS LM and find that fine-tuning with CS data after pretraining on monolingual data works best. Finally, another line of works suggests using a dual language model, where two monolingual LMs are combined by a probabilistic model (Garg et al., 2017, 2018).

No standard benchmark or evaluation setup exists for CS LM, and most previous works use an ASR system as part of their evaluation setup. This makes comparison of methods very challenging. Indeed, all the works listed above use different setups and don’t compare to each other, even for works coming from the same group. We believe the evaluation setup we propose in this work and our English-Spanish dataset, which is easy to replicate and decoupled from an ASR system, is a needed step towards meaningful comparison of CS LM approaches.

11 Conclusions and Future Work

We consider the topic of language modeling for code-switched data. We propose a novel ranking-based evaluation method for language models, motivated by speech recognition, and create an evaluation dataset for English-Spanish code-switching LM (SpanglishEval).

We further present a discriminative training for this ranking task that is intended for ASR purposes. This training procedure is not bound to probability distributions, and uses both positive and negative training sentences. This significantly improves performance. Such discriminative training can also be applied to monolingual data.

Finally, we verify the effectiveness of the training protocol for CS LM presented in Baheti et al. Baheti et al. (2017): pre-training on a mix of monolingual sentences, followed by fine-tuning on a code-switched dataset. This protocol significantly outperforms various baselines. Moreover, we show that the less code-switched training data we use, the more effective it is to incorporate the monolingual data.

Our proposed evaluation framework and dataset will facilitate future work by providing the ability to meaningfully compare the performance of different methods to each other, an ability that was sorely missing in previous work.


The work was supported by The Israeli Science Foundation (grant number 1555/15).


  • Adel et al. (2014) Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc Thang Vu, Tim Schlippe, and Tanja Schultz. 2014. Features for factored language models for code-switching speech. In Proceedings of SLTU.
  • Adel et al. (2015) Heike Adel, Ngoc Thang Vu, Katrin Kirchhoff, Dominic Telaar, and Tanja Schultz. 2015. Syntactic and semantic features for code-switching factored language models. IEEE Transactions on Audio, Speech, and Language Processing, 23(3).
  • Adel et al. (2013a) Heike Adel, Ngoc Thang Vu, Franziska Kraus, Tim Schlippe, Haizhou Li, and Tanja Schultz. 2013a. Recurrent neural network language modeling for code switching conversational speech. In Proceedings of ICASSP, IEEE International Conference.
  • Adel et al. (2013b) Heike Adel, Ngoc Thang Vu, and Tanja Schultz. 2013b. Combination of recurrent neural networks and factored language models for code-switching language modeling. In Proceedings of ACL.
  • Auer (1999) Peter Auer. 1999. From codeswitching via language mixing to fused lects: Toward a dynamic typology of bilingual speech. International journal of bilingualism, 3(4):309–332.
  • Baheti et al. (2017) Ashutosh Baheti, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. 2017. Curriculum design for code-switching: Experiments with language identification and language modeling with deep neural networks. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017).
  • Barman et al. (2016) Utsab Barman, Joachim Wagner, and Jennifer Foster. 2016. Part-of-speech tagging of code-mixed social media content: Pipeline, stacking and joint modelling. In Proceedings of the Second Workshop on Computational Approaches to Code Switching.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3:1137–1155.
  • Brown et al. (1992) Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.
  • Bullock and Toribio (2009) Barbara E. Bullock and Almeida Jacqueline Toribio. 2009. The Cambridge handbook of linguistic code-switching. Cambridge University Press Cambridge.
  • Çetinoğlu et al. (2016) Özlem Çetinoğlu, Sarah Schulz, and Ngoc Thang Vu. 2016. Challenges of computational processing of code-switching. In Proceedings of the 2nd Workshop on Computational Approaches to Linguistic Code Switching, EMNLP.
  • Chan et al. (2009) Joyce YC Chan, Houwei Cao, PC Ching, and Tan Lee. 2009. Automatic recognition of cantonese-english code-mixing speech. Computational Linguistics and Chinese Language Processing, 14(3):281–304.
  • Chen and Goodman (1996) Stanley F Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of ACL, pages 310–318.
  • Cotterell et al. (2014) Ryan Cotterell, Adithya Renduchintala, Naomi Saphra, and Chris Callison-Burch. 2014. An algerian arabic-french code-switched corpus. In Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme.
  • Dikici et al. (2013) Erinç Dikici, Murat Semerci, Murat Saraclar, and Ethem Alpaydin. 2013. Classification and ranking approaches to discriminative language modeling for asr. IEEE Transactions on Audio, Speech, and Language Processing, 21(2):291–300.
  • Foerster et al. (2017) Jakob N Foerster, Justin Gilmer, Jan Chorowski, Jascha Sohl-Dickstein, and David Sussillo. 2017. Intelligible language modeling with input switched affine networks. arXiv preprint arXiv:1611.09434.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of NIPS.
  • Garg et al. (2017) Saurabh Garg, Tanmay Parekh, and Preethi Jyothi. 2017. Dual language models for code switched speech recognition. arXiv preprint arXiv:1711.01048.
  • Garg et al. (2018) Saurabh Garg, Tanmay Parekh, and Preethi Jyothi. 2018. Code-switched language models using dual rnns and same-source pretraining. arXiv preprint arXiv:1809.01962.
  • Ghosh et al. (2016) Souvick Ghosh, Satanu Ghosh, and Dipankar Das. 2016. Part-of-speech tagging of code-mixed social media text. In Proceedings of the Second Workshop on Computational Approaches to Code Switching.
  • Huang et al. (2018) Jiaji Huang, Yi Li, Wei Ping, and Liang Huang. 2018. Large margin neural language model. In Proceedings of EMNLP.
  • Kuo et al. (2002) Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, and Chin-Hui Lee. 2002. Discriminative training of language models for speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1. IEEE.
  • Li and Fung (2012) Ying Li and Pascale Fung. 2012. Code-switch language model with inversion constraints for mixed language speech recognition. In Proceedings of COLING, pages 1671–1680.
  • Li and Fung (2014) Ying Li and Pascale Fung. 2014. Language modeling with functional head constraint for code switching speech recognition. In Proceedings of EMNLP, pages 907–916.
  • Melis et al. (2017) Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv:1707.05589.
  • Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing lstm language models. arXiv:1708.02182.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of Interspeech.
  • Molina et al. (2016) Giovanni Molina, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Nicolas Rey-Villamizar, Mona Diab, and Thamar Solorio. 2016. Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching.
  • Morin and Bengio (2005) Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, volume 5.
  • Muysken (2000) Pieter Muysken. 2000. Bilingual speech: A typology of code-mixing, volume 11. Cambridge University Press.
  • Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv:1701.03980.
  • Poplack (1980) Shana Poplack. 1980. Sometimes i’ll start a sentence in spanish y termino en español: toward a typology of code-switching. Linguistics, 18(7-8):581–618.
  • Pratapa et al. (2018) Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, Sandipan Dandapat, and Kalika Bali. 2018. Language modeling for code-mixing: The role of linguistic theory based synthetic data. In Proceedings of ACL.
  • Roark et al. (2007) Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech & Language, 21(2):373–392.
  • Sennrich (2016) Rico Sennrich. 2016. How grammatical is character-level neural machine translation? assessing MT quality with contrastive translation pairs. arXiv:1612.04629.
  • Solorio et al. (2014) Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang, and Pascale Fung. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching.
  • Solorio and Liu (2008) Thamar Solorio and Yang Liu. 2008. Part-of-speech tagging for english-spanish code-switched text. In Proceedings of EMNLP, pages 1051–1060.
  • Sreeram and Sinha (2017) Ganji Sreeram and Rohit Sinha. 2017. Language modeling for code-switched data: Challenges and approaches. arXiv:1711.03541.
  • Tiedemann (2009) Jörg Tiedemann. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, pages 237–248.
  • Tran et al. (2018) Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The importance of being recurrent for modeling hierarchical structure. arXiv:1803.03585.
  • Vu et al. (2012) Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja Schultz, and Haizhou Li. 2012. A first speech recognition system for mandarin-english code-switch conversational speech. In Proceedings of ICASSP, IEEE International Conference, pages 4889–4892.
  • Vyas et al. (2014) Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury. 2014. POS tagging of English-Hindi code-mixed social media content. In Proceedings of EMNLP.
  • Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv:1409.2329.

Appendix A Creating Evaluation Dataset – Implementation Details

Our implementation is based on the Carmel FST toolkit.1111 11 carmel/ We create an FST for converting a sentence into a sequence of phonemes, and its inverse FST. The words to phoneme mapping is based on pronunciation dictionaries, according to the language tag of each word in the sentence.

We use The CMU Pronouncing Dictionary1212 12 cmudict for English and a dictionary from CMUSphinx1313 13 cmusphinx/files/Acoustic%20and% 20Language%20Models/Spanish/ for Spanish. As the phoneme inventories in the two datasets do not match, we map the Spanish phonemes to the CMU dict inventory using a manually constructed mapping.1414 14 The full mapping from Spanish to English: ch-CH, rr-R, gn-NG, a-AA, b-B, b-V, e-EY, d-D, d-DH, g-G, f-F, i-IY, k-K, j-H, m-M, n-N, l-L, o-OW, p-P, s-S, r-R, u-UW, t-T, y-Y, x-S, x-SH, x-K S, x-H, z-TH, z-S, ll-L Y, ll-SH. We thank Kyle Gorman for helping with the mapping.

To favor frequent words over infrequent ones, we add unigram probabilities to the edges of the transducer (taken from googlebooks unigrams1515 15 ngrams/books/datasetsv2.html). We filter some words that produce noise (for example, single letter words that are too frequent). When creating a monolingual sentence, we use an FST with the words of that language only.

As many phoneme sequences in Spanish do not produce English alternatives (and vice versa) we allow minor changes in the phoneme sequences between the languages. Specifically, we create a small list of similar phonemes (such as ”B” and ”V”),1616 16 The full list of similar phonemes: OW - UW, AA - EY, L - M, N - M, M - L, B - P, B - V, V - F, T - D, K - G, S - Z, S - TH, Z - TH, SH - ZH and generate an FST that for each phoneme allows changing it to one of its alternatives or dropping it with low probability.

Since using the whole sentence has higher chances of encountering words that are not included in the dictionaries, we only convert a sampled part of the gold sentence when creating a code-switched alternative. This also results in alternatives with higher similarity to the gold sentence. However, when creating a monolingual alternative (i.e. a Spanish alternative to an English gold sentence), we have no choice but to use the whole sentence.

Appendix B Data

B.1 Code-switching corpus

We pre-processed the Bangor Miami Corpus by lower-casing it and tokenizing using the spaCy tokenizer.1717 17 We did not reduce the vocabulary size which was quite small to begin with (13,914 words). After preprocessing, we got 45,621 sentences with 322,044 tokens.

B.2 Monolingual corpora

This data from the OpenSubtitles2018 corpus (Tiedemann, 2009) comes pre-tokenized. We pre-processed it by lower-casing, removing parenthesis and their contents, and removing hyphens from beginning of sentences.

We use 1M lines from each language, resulting in 7,501,714 tokens in English and 6,566,337 tokens in Spanish. We have 45,280 words in the English vocabulary and 50K words in the Spanish one (reduced from 83,615).

Appendix C Architecture and Training Details

The LSTM has a hidden layer of dimension 650. The input embeddings are of dimension 300. We use auto-batching with batches of size 20. We optimize with SGD and learning rate of 10, reducing it by a factor of 2.5 at the end of each epoch with no improvement. We also use clipping gradients of 1, and weight decay of 10-5. We initialize the parameters of the LSTM to be in the range of [-0.05,0.05]. We also use word dropout with the rate of 0.2. We set the dropout in our LSTM (Gal and Ghahramani, 2016) to 0.35. We train for 40 epochs and use the best model on the dev set.