Text classification must sometimes be applied in a low-resource language withno labeled training data. However, training data may be available in a relatedlanguage. We investigate whether character-level knowledge transfer from arelated language helps text classification. We present a cross-lingual documentclassification framework (CACO) that exploits cross-lingual subword similarityby jointly training a character-based embedder and a word-based classifier. Theembedder derives vector representations for input words from their writtenforms, and the classifier makes predictions based on the word vectors. We use ajoint character representation for both the source language and the targetlanguage, which allows the embedder to generalize knowledge about sourcelanguage words to target language words with similar forms. We propose amulti-task objective that can further improve the model if additionalcross-lingual or monolingual resources are available. Experiments confirm thatcharacter-level knowledge transfer is more data-efficient than word-leveltransfer between related languages.
Quick Read (beta)
Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.
Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Mozhi Zhang cs and umiacs University of Maryland College Park, MD, USA [email protected] Yoshinari Fujinuma Computer Science University of Colorado Boulder, CO, USA [email protected] Jordan Boyd-Graber††thanks: Now at Google Research Zürich cs, iSchool, lsc, and umiacs University of Maryland College Park, MD, USA [email protected]
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction: Classifiers across Languages
Modern machine learning methods in natural language processing can learn highly accurate, context-based classifiers (?). Despite this revolution for high-resource languages such as English, some languages are left behind because of the dearth of text data generally and specifically labeled data. Often, the need for a text classifier in a low-resource language is acute, as text classifiers can provide situational awareness in emergent incidents (?). Cross-lingual document classification (?, cldc) attacks this problem by using annotated dataset from a source language to build classifiers for a target language.
cldc works when it can find a shared representation for documents from both languages: train a classifier on source language documents and apply it on target language documents. Previous work uses a bilingual lexicon (?; ?), machine translation (?; ?; ?, mt), topic models (?; ?), cross-lingual word embeddings (?, clwe), or multilingual contextualized embeddings (?) to extract cross-lingual features. But these methods may be impossible in low-resource languages, as they require some combination of large parallel or comparable text, high-coverage dictionaries, and monolingual corpora from a shared domain.
However, as anyone who has puzzled out a Portuguese menu from their high school Spanish knows, the task is not hopeless, as languages do not exist in isolation. Shared linguistic roots, geographic proximity, and history bind languages together; cognates abound, words sound the same, and there are often shared morphological patterns. These similarities are often not found at word-level but at character-level. Therefore, we investigate character-level knowledge transfer for cldc in truly low-resource settings, where unlabeled or parallel data in the target language is also limited or unavailable.
To study knowledge transfer at character level, we propose a cldc framework, Classification Aided by Convergent Orthography (caco) that capitalizes on character-level similarities between related language pairs. Previous cldc methods treat words as atomic symbols and do not transfer character-level patterns across languages; caco instead uses a bi-level model with two components: a character-based embedder and a word-based classifier.
The embedder exploits shared patterns in related languages to create word representations from character sequences. The classifier then uses the shared representation across languages to label the document. The embedder learns morpho-semantic regularities, while the classifier connects lexical semantics to labels.
To allow cross-lingual transfer, we use a single model with shared character embeddings for both languages. We jointly train the embedder and the classifier on annotated source language documents. The embedder transfers knowledge about source language words to target language words with similar orthographic features.
While the model can be fairly accurate without any target language data, it can also benefit from a small amount of additional information when available. If we have a dictionary, pre-trained monolingual word embeddings, or parallel text, we can fine-tune the model with multi-task learning. We encourage the embedder to produce similar word embeddings for translation pairs from a dictionary, which captures patterns between cognates. We also teach the embedder to mimick pre-trained word embeddings in the source language (?), which exposes the model to more word types. When we have a good reference model in another high-resource language, we can train our model to make similar predictions as the reference model on parallel text (?).
We verify the effectiveness of character-level knowledge transfer on two cldc benchmarks. When we have enough data to learn high-quality clwe, training classifiers with clwe as input features is a strong cldc baseline. caco can match the accuracy of clwe-based models without using any target language data, and fine-tuning the embedder with a small amount of additional resources improves caco’s accuracy. Finally, caco is also useful when we have enough resources to train good clwe—using clwe as extra features, caco outperforms the baseline clwe-based models by a large margin.
2 caco: Classification Aided by Convergent Orthography
This section introduces our method, caco, which trains a multilingual document classifier using labeled datasets in a source language and applies the classifier to a low-resource target language . We focus on the setting where and are related and have similar orthographic features.
2.1 Model Architecture
Let be an input document with a sequence of words , where each word is a sequence of character. Our model maps the document to a distribution over possible labels in two steps (Figure 1). First, we generate a word embedding for each input word using a character-based embedder :
We then feed the word embeddings to a word-based classifier to compute the distribution over labels :
We can use any sequence model for the embedder and the classifier . For our experiments, we use a bidirectional lstm (?, bi-lstm) embedder and a deep averaging network (?, dan) classifier.
bi-lstm is a powerful sequence model that captures complex non-local dependencies. Character-based bi-lstm embedders are used in many natural language processing tasks (?; ?; ?). To embed a word , we pass the character sequence to a left-to-right lstm and the reversed character sequence to a right-to-left lstm. We concatenate the final hidden states of the two lstm and apply a linear transformation:
where the functions and compute the final hidden states of the two lstms.
A dan is an unordered model that passes the arithmetic mean of the input word embeddings through a multilayer perceptron and feeds the final layer’s representation to a softmax layer. dan ignores cross-lingual variations in word order (i.e., syntax) and thus generalizes well in cldc. Despite its simplicity, dan has near state-of-the-art accuracies on both monolingual (?) and cross-lingual document classification (?).
Let be the word embeddings generated by the character-based embedder. dan uses the average of the word embeddings as the document representation :
and is passed through layers of non-linearity:
where ranges from 1 to , and is a non-linear activation function. The final representation is passed to a softmax layer to obtain a distribution over the label ,
We use the same classifier parameters across languages. In other words, the dan classifier is language-independent. This is possible because the embedder generates consistent word representations across related languages, which we discuss in the next section.
2.2 Character-Level Cross-Lingual Transfer
To transfer character-level information across languages, the embedder uses the same character embeddings for both languages. The character-level bi-lstm vocabulary is the union of the alphabets for the two languages, and the embedder does not differentiate identical characters from different languages. For example, a Spanish “a” has the same character embedding as a French “a”. Consequently, the embedder maps words with similar forms from both languages to similar vectors.
If the source language and the target language are orthographically similar, the embedder can generalize knowledge learned about source language words to target language words through shared orthographic features. As an example, if the model learns that the Spanish word “religioso” (religious) is predictive of label , the model automatically infers that “religioso” in Italian is also predictive of , even though the model never sees any Italian text.
In our experiments, we focus on related language pairs that share the same script. For related languages with different scripts, we can apply caco to the output of a transliteration tool or a grapheme-to-phoneme transducer (?). We leave this to future work.
2.3 Training Objective
Our main objective is supervised document classification. We jointly train the classifier and the embedder to minimize average negative log-likelihood on labeled source language documents :
where is a vector representing all model parameters, and is a set of source language examples with words and label .
Sometimes we have additional resources for the source or target language. We use them to improve caco with multi-task learning (?) via three auxiliary tasks.
Word Translation (dict).
There are many patterns when translating cognate words between related languages. For example, Italian “e” often becomes “ie” in Spanish. “Tempo” (time) in Italian becomes “tiempo” in Spanish, and “concerto” (concert) in Italian becomes “concierto” in Spanish. The embedder can learn these word translation patterns from a bilingual dictionary.
Let be a bilingual dictionary with a set of word pairs , where and are translations of each other. We add a term to our objective to minimize average squared Euclidean distances between the embeddings of translation pairs (?):
Mimicking Word Embeddings (mim).
Monolingual text classifiers often benefit from initializing embeddings with word vectors pre-trained on large unlabeled corpus (?). This semi-supervised learning strategy helps the model generalize to word types outside labeled training data. Similarly, our embedder can mimick (?) an existing source language word embeddings to generalize better.
Suppose we have a pre-trained source language word embedding matrix with rows. The -th row is a vector for the -th word type . We add an objective to minimize the average squared Euclidean distances between the output of the embedder and :
|Source labeled data||✓||✓||✓||✓||✓||✓||✓|
|Pre-trained source embedding||✓||✓|
|Target labeled data||✓|
|rcv2 average accuracy||50.0||55.7||51.5||54.7||51.6||51.9||64.5|
Sometimes we have a reliable reference classifier in another high-resource language (e.g., English). If we have parallel text between and , we can use knowledge distillation (?) to supply additional training signal. Let be a set of parallel documents , where is from source language , and is the translation of in . We add another objective term to minimize the average Kullback-Leibler divergence between the predictions of our model and the reference model:
where is the output of the reference classifier (in language , and is the output of caco. In § 3, we mark models that use knowledge distillation with a superscript “p”.
We train on the four tasks jointly. Our final objective is:
where the hyperparameters , , and trade off between the four tasks.
When the source language and the target language are related, we expect character-level knowledge transfer to be more data-efficient than word-level knowledge transfer because character-level transfer allows generalization across words with similar forms. We test this by comparing caco models trained in low-resource settings and with clwe-based models trained in high-resource settings on two cldc datasets. We also compare caco with a supervised monolingual model. On both datasets, caco models have similar average accuracy as the baselines while requiring much less target language data. Finally, we train models that combine caco with clwe, which have significantly higher accuracy than models with only clwe as features. These results confirms that character-level similarities between related languages effectively transfer knowledge for cldc.
3.1 Classification Dataset
Our first dataset is Reuters multilingual corpus (rcv2), a collection of news stories labeled with four topics (?):Corporate/Industrial (ccat), Economics (ecat), Government/Social (gcat), and Markets (mcat). Following ? (?), we remove documents with multiple topic labels. For each language, we sample 1,500 training documents and 200 test documents with balanced labels. We conduct cldc experiments between two North Germanic languages, Danish (da) and Swedish (sv), and three Romance languages, French (fr), Italian (it), and Spanish (es).
To test caco on truly low-resource languages, we build a second cldc dataset with famine-related documents sampled from Tigrinya (ti) and Amharic (am) lorelei language packs (?). We train binary classifiers to detect whether the document describes widespread crime or not. For Tigrinya documents, the labels are extracted from the situation frame annotation in the language pack. We mark all documents with a “widespread crime/violence” situation frame as positive. The Amharic language pack does not have annotations, so we label Amharic sentences based on English reference translations included from the language pack. Our dataset contains 394 Tigrinya and 370 Amharic documents with balanced labels.
We compare caco trained under low-resource settings with word-based models that use more resources. Table 1 summarizes our models.
We experiment with several variants of caco that uses different resources. The src model uses the least amount of resource. It is only trained on labeled source language documents and do not use any unlabeled data. The dict model requires a dictionary and is trained with the word translation auxiliary task. The mim model requires a pre-trained source language embedding and uses the mimick auxliliary task. The all model is the most expensive variant. It is trained with both the word translation and the mimick auxiliary tasks. In lorelei experiments, we also use knowledge distillation to provide more classification signals for some models. We mark these models with a superscript “p”.
Our first word-based model is a dan with pre-trained multiCCA clwe features (?). The clwe are trained on large target language corpora with millions of tokens and high-coverage dictionaries with hundreds of thousands of word types. In contrast, we train caco models in a simulated low-resource setting with few or no target language data. Despite the resource gap, caco models have similar average test accuracy as clwe-based models, demonstrating the effectiveness of character-level transfer learning.
Next, we compare caco with a lightly-supervised monolingual model (sup), a word-based dan trained on fifty labeled target language documents. We only apply this baseline to rcv2, because the labeled document sets in lorelei are too small to split further. The supervised model requires labeled target language documents, which often do not exist in labeled documents. Without using any target language supervision, caco models have similar (and sometimes higher) test accuracies as sup, showing that caco effectively learns from a related language.
Finally, we experiment with a model that combines caco and clwe (com) by feeding pre-trained clwe as additional features for the classifier of a caco model (src variant). This model requires the same amount of resource as the clwe-based model. The combined model on average has much higher accuracy than both caco variants and clwe-based model, showing that character-level knowledge transfer is useful even when we have enough unlabeled data to train high-quality clwe.
3.3 Auxiliary Task Data
Some of the caco models (dict and all) use a dictionary to learn word translation patterns. We train them on the same training dictionary used for pre-training the clwe. To simulate the low-resource setting, we sample only 100 translation pairs from the original dictionary for caco. Pilot experiments confirm that a larger dictionary can help, but we focus on the low-resource setting where only a small dictionary is available.
The Amharic labeled dataset is very small compared to other languages because each Amharic example only contains one sentence. As introduced in Section 2.3, one way to provide additional training signal is by knowledge distillation from a third high-resource language. For the Amharic to Tigrinya cldc experiment, we apply knowledge distillation using English-Amharic parallel text. We first train a reference English dan on a large collection of labeled English documents compiled from other lorelei language packs. We then use the knowledge distillation objective to train the caco models to match the output of the English model on 1,200 English-Amharic parallel documents sampled from the Amharic language pack. To avoid introducing extra label bias, we sample the parallel documents such that the English model output approximately follows a uniform distribution.
We do not use knowledge distillation on other language pairs. For rcv2, we already have enough labeled examples and therefore do not need knowledge distillation. For Tigrinya to Amharic cldc experiment, we do not have enough unlabeled parallel text in the Tigrinya language pack to apply knowledge distillation.
3.4 Training Details
For clwe-based models, we use forty dimensional multiCCA word embeddings (?). We use three ReLU layers with 100 hidden units and 0.1 dropout for the clwe-based dan models and the dan classifier of the caco models. The bi-lstm embedder uses ten dimensional character embeddings and forty hidden states with no dropout. The outputs of the embedder are forty dimensional word embeddings. We set to 1, to , and to 1 in the multi-task objective (Equation 11). The hyperparameters are tuned in a pilot Italian-Spanish cldc experiment using held-out datasets.
All models are trained with Adam (?) with default settings. We run the optimizer for a hundred epochs with mini-batches of sixteen documents. For models that use additional resources, we also sample sixteen examples from each type of training data (translation pairs, pre-trained embeddings, or parallel text) to estimate the gradients of the auxiliary task objectives , , and (defined in Section 2.3) at each iteration.
3.5 Effectiveness of caco
We train each model using ten different random seeds and report their average test accuracy. For models that use dictionaries, we also re-sample the training dictionary for each run. Table 1 compares resource requirement and average rcv2 accuracy of caco and baselines. Table 2 and 3 show test accuracies on nine related language pairs from rcv2 and lorelei.
Character-Level Knowledge Transfer.
Experiments confirm that character-level knowledge transfer is sample-efficient and complementary to word-level knowledge transfer. The low-resource character-based caco models have similar average test accuracy as the high-resource word-based models. The src variant does not use any target language data, and yet its average test accuracy on rcv2 (50.0%) is very close to the clwe model (51.6%) and the supervised model sup (51.6%). When we already have a good clwe, we can get the best of both worlds by combining them (com), which has a much higher average test accuracy (64.5%) than caco and the two baselines.
Training caco with multi-task learning further improves the accuracy. For almost all language pairs, the multi-task caco variants have higher test accuracies than src. On rcv2, word translation (dict) is particularly effective even with only 100 translation pairs. It increases average test accuracy from 50.0% to 55.7%, outperforming both word-based baseline models. Interestingly, word translation and mimick tasks together (all) do not consistently increase the accuracy over only using the dictionary (dict). On the lorelei dataset where labeled document is limited, knowledge distillation (srcp and mimp) also increases accuracies by around 1.5%.
We expect character-level knowledge transfer to be less effective on language pairs when the source language and the target language are less close to each other. For comparison, we experiment on rcv2 with transferring between more distantly related language pairs: a North Germanic language and a Romance language (Table 4). Indeed, caco models score consistently lower than the clwe-based models when transferring from a North Germanic source language to a Romance target language. However, caco models are surprisingly competitive with clwe-based models when transferring from the opposite direction. This asymmetry is likely due to morphological differences between the two language families. Unfortunately, our datasets only have a limited number of language families. We leave a more systematic study on how language proximity affect the effectiveness of caco to future work.
Languages can be similar along different dimensions, and therefore adding more source languages may be beneficial. On rcv2, we experiment with training caco models on two Romance languages and testing on a third Romance language. Moreover, using multiple source languages has a regularization effect and prevents the model from overfitting to a single source language. For fair comparison, we sample 750 training documents from each source language, so that the multi-source models are still trained on 1,500 training documents (like the single-source models). We use a similar strategy to sample the training dictionaries and pre-trained word embeddings. Multi-source models (Table 5) consistently have higher accuracies than single-source models (Table 2).
Learned Word Representation.
Word translation is a popular intrinsic evaluation task for cross-lingual word representations. Therefore, we evaluate the word representations learned by the bi-lstm embedder on a word translation benchmark. Specifically, we use the src embedder to generate embeddings for all French, Italian, and Spanish words that appear in multiCCA’s vocabulary and translate each word with nearest-neighbor search. Table 6 shows the top-1 word translation accuracy on the test dictionaries from muse (?). Although the src embedder is not exposed to any cross-lingual signal, it rivals clwe on the word translation task by exploiting character-level similarities between languages.
To understand how cross-lingual character-level similarity helps classification, we manually compare the output of a clwe-based model and a caco model (dict variant) from the Spanish to Italian cldc experiment. Sometimes caco avoids the mistakes of clwe-based models by correctly aligning word pairs that are misaligned in the pre-trained clwe. For example, in the clwe, “relevancia” (relevance) is the closest Spanish word for the Italian word “interesse” (interest), while the caco embedder maps both the Italian word “interesse” (interest) and the Spanish word “interesse” (interest) to the same point. Consequently, caco correctly classifies an Italian document about the interest rate with gcat (government), while the clwe-based model predicts mcat (market).
4 Related Work
Previous cldc methods are typically word-based and rely on one of the following cross-lingual signals to transfer knowledge: large bilingual lexicons (?; ?), mt systems (?; ?; ?), or clwe (?). One exception is the recently proposed multilingual BERT (?; ?), which uses a subword vocabulary. Unfortunately, some languages do not have these resources. caco can help bridge the resource gap. By exploiting character-level similarities between related languages, caco can work effectively with few or no target language data.
To adapt clwe to low-resource settings, recent unsupervised clwe methods (?; ?) do not use dictionary or parallel text. These methods can be further improved with careful normalization (?) and interactive refinement (?). However, unsupervised clwe methods still require large monolingual corpora in the target language, and they might fail when the monolingual corpora of the two languages come from different domains (?; ?) and when the two language have different morphology (?). In contrast, caco does not require any target language data.
Cross-lingual transfer at character-level is successfully used in low-resource paradigm completion (?), morphological tagging (?), part-of-speech tagging (?), and named entity recognition (?; ?; ?; ?), where the authors train a character-level model jointly on a small labeled corpus in target language and a large labeled corpus in source language. Our method is similar in spirit, but we focus on cldc, where it is less obvious if orthographic features are helpful. Moreover, we introduce a novel multi-task objective to use different types of monolingual and cross-lingual resources.
We investigate character-level knowledge transfer between related languages for cldc. Our transfer learning scheme, caco, exploits character-level similarities between related languages through shared character representations to generalize from source language data. Empirical evaluation on multiple related language pairs confirm that character-level knowledge transfer is highly effective.
We thank the members of UMD CLIP and the anonymous reviewers for their feedback. Zhang and Boyd-Graber are supported by DARPA award HR0011-15-C-0113 under subcontract to Raytheon BBN Technologies. Fujinuma and Boyd-Graber are supported by NSF grant IIS-1564275. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.
- [Ammar et al. 2016] Ammar, W.; Mulcaire, G.; Tsvetkov, Y.; Lample, G.; Dyer, C.; and Smith, N. A. 2016. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
- [Andrade et al. 2015] Andrade, D.; Sadamasa, K.; Tamura, A.; and Tsuchida, M. 2015. Cross-lingual text classification using topic-dependent word probabilities. In NAACL.
- [Artetxe, Labaka, and Agirre 2018] Artetxe, M.; Labaka, G.; and Agirre, E. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In ACL.
- [Ballesteros, Dyer, and Smith 2015] Ballesteros, M.; Dyer, C.; and Smith, N. A. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP.
- [Banea et al. 2008] Banea, C.; Mihalcea, R.; Wiebe, J.; and Hassan, S. 2008. Multilingual subjectivity analysis using machine translation. In EMNLP.
- [Bharadwaj et al. 2016] Bharadwaj, A.; Mortensen, D. R.; Dyer, C.; and Carbonell, J. G. 2016. Phonologically aware neural model for named entity recognition in low resource transfer settings. In EMNLP.
- [Chen et al. 2018] Chen, X.; Sun, Y.; Athiwaratkun, B.; Cardie, C.; and Weinberger, K. 2018. Adversarial deep averaging networks for cross-lingual sentiment classification. TACL 6:557–570.
- [Collobert et al. 2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural language processing (almost) from scratch. JMLR 12:2493–2537.
- [Conneau et al. 2018] Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and Jégou, H. 2018. Word translation without parallel data. In ICLR.
- [Cotterell and Duh 2017] Cotterell, R., and Duh, K. 2017. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In IJCNLP.
- [Cotterell and Heigold 2017] Cotterell, R., and Heigold, G. 2017. Cross-lingual character-level neural morphological tagging. In EMNLP.
- [Czarnowska et al. 2019] Czarnowska, P.; Ruder, S.; Grave, E.; Cotterell, R.; and Copestake, A. 2019. Don’t forget the long tail! a comprehensive analysis of morphological generalization in bilingual lexicon induction. In EMNLP.
- [Devlin et al. 2019] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- [Fujinuma, Boyd-Graber, and Paul 2019] Fujinuma, Y.; Boyd-Graber, J.; and Paul, M. J. 2019. A resource-free evaluation metric for cross-lingual word embeddings based on graph modularity. In ACL.
- [Graves and Schmidhuber 2005] Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6):602–610.
- [Iyyer et al. 2015] Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daumé III, H. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL.
- [Kann, Cotterell, and Schütze 2017] Kann, K.; Cotterell, R.; and Schütze, H. 2017. One-shot neural cross-lingual transfer for paradigm completion. In ACL.
- [Kim et al. 2017] Kim, J.-K.; Kim, Y.-B.; Sarikaya, R.; and Fosler-Lussier, E. 2017. Cross-lingual transfer learning for POS tagging without cross-lingual resources. In EMNLP.
- [Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
- [Klementiev, Titov, and Bhattarai 2012] Klementiev, A.; Titov, I.; and Bhattarai, B. 2012. Inducing crosslingual distributed representations of words. In COLING.
- [Lample et al. 2016] Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In NAACL.
- [Lewis et al. 2004] Lewis, D. D.; Yang, Y.; Rose, T. G.; and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. JMLR 5(Apr):361–397.
- [Lin et al. 2018] Lin, Y.; Yang, S.; Stoyanov, V.; and Ji, H. 2018. A multi-lingual multi-task architecture for low-resource sequence labeling. In ACL.
- [Ling et al. 2015] Ling, W.; Dyer, C.; Black, A. W.; Trancoso, I.; Fermandez, R.; Amir, S.; Marujo, L.; and Luís, T. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In EMNLP.
- [Mikolov, Le, and Sutskever 2013] Mikolov, T.; Le, Q. V.; and Sutskever, I. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
- [Mimno et al. 2009] Mimno, D.; Wallach, H.; Naradowsky, J.; Smith, D.; and McCallum, A. 2009. Polylingual topic models. In EMNLP.
- [Mortensen, Dalmia, and Littell 2018] Mortensen, D. R.; Dalmia, S.; and Littell, P. 2018. Epitran: Precision G2P for many languages. In LREC.
- [Pinter, Guthrie, and Eisenstein 2017] Pinter, Y.; Guthrie, R.; and Eisenstein, J. 2017. Mimicking word embeddings using subword RNNs. In EMNLP.
- [Rijhwani et al. 2019] Rijhwani, S.; Xie, J.; Neubig, G.; and Carbonell, J. G. 2019. Zero-shot neural transfer for cross-lingual entity linking. In AAAI.
- [Shi, Mihalcea, and Tian 2010] Shi, L.; Mihalcea, R.; and Tian, M. 2010. Cross language text classification by model translation and semi-supervised learning. In EMNLP.
- [Søgaard, Ruder, and Vulić 2018] Søgaard, A.; Ruder, S.; and Vulić, I. 2018. On the limitations of unsupervised bilingual dictionary induction. In ACL.
- [Strassel and Tracey 2016] Strassel, S., and Tracey, J. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In LREC.
- [Wan 2009] Wan, X. 2009. Co-training for cross-lingual sentiment classification. In ACL.
- [Wu and Dredze 2019] Wu, S., and Dredze, M. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In EMNLP.
- [Xu and Yang 2017] Xu, R., and Yang, Y. 2017. Cross-lingual distillation for text classification. In ACL.
- [Yuan et al. 2019] Yuan, M.; Zhang, M.; Durme, B. V.; Findlater, L.; and Boyd-Graber, J. 2019. Interactive refinement of cross-lingual word embeddings. arXiv preprint arXiv:1911.03070.
- [Yuan, Van Durme, and Boyd-Graber 2018] Yuan, M.; Van Durme, B.; and Boyd-Graber, J. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. In NeurIPS.
- [Zhang et al. 2019] Zhang, M.; Xu, K.; Kawarabayashi, K.; Jegelka, S.; and Boyd-Graber, J. 2019. Are girls neko or shōjo? Cross-lingual alignment of non-isomorphic embeddings with Iterative Normalization. In ACL.
- [Zhou, Wan, and Xiao 2016] Zhou, X.; Wan, X.; and Xiao, J. 2016. Cross-lingual sentiment classification with bilingual document representation learning. In ACL.