Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

  • 2019-11-20 18:57:21
  • Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber
  • 0


Text classification must sometimes be applied in a low-resource language withno labeled training data. However, training data may be available in a relatedlanguage. We investigate whether character-level knowledge transfer from arelated language helps text classification. We present a cross-lingual documentclassification framework (CACO) that exploits cross-lingual subword similarityby jointly training a character-based embedder and a word-based classifier. Theembedder derives vector representations for input words from their writtenforms, and the classifier makes predictions based on the word vectors. We use ajoint character representation for both the source language and the targetlanguage, which allows the embedder to generalize knowledge about sourcelanguage words to target language words with similar forms. We propose amulti-task objective that can further improve the model if additionalcross-lingual or monolingual resources are available. Experiments confirm thatcharacter-level knowledge transfer is more data-efficient than word-leveltransfer between related languages.


Quick Read (beta)

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Mozhi Zhang
cs and umiacs
University of Maryland
College Park, MD, USA
[email protected] &Yoshinari Fujinuma
Computer Science
University of Colorado
Boulder, CO, USA
[email protected] &Jordan Boyd-Graber
cs, iSchool, lsc, and umiacs
University of Maryland
College Park, MD, USA
[email protected]
Now at Google Research Zürich

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Mozhi Zhang cs and umiacs University of Maryland College Park, MD, USA [email protected]                        Yoshinari Fujinuma Computer Science University of Colorado Boulder, CO, USA [email protected]                        Jordan Boyd-Graberthanks: Now at Google Research Zürich cs, iSchool, lsc, and umiacs University of Maryland College Park, MD, USA [email protected]

Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Computation graph of caco on an example sentence (“cats eat fish”). Bottom: Each input word 𝐰i is mapped to a vector 𝐯i by passing its characters through a bi-lstm embedder. Top: Word vectors {𝐯𝐢} are then passed through a dan classifier to predict the label y. Specifically, dan transforms the average of the word vectors 𝐳0 with k layers of non-linearity and a final softmax layer.

1 Introduction: Classifiers across Languages

Modern machine learning methods in natural language processing can learn highly accurate, context-based classifiers (?). Despite this revolution for high-resource languages such as English, some languages are left behind because of the dearth of text data generally and specifically labeled data. Often, the need for a text classifier in a low-resource language is acute, as text classifiers can provide situational awareness in emergent incidents (?). Cross-lingual document classification (?, cldc) attacks this problem by using annotated dataset from a source language to build classifiers for a target language.

cldc works when it can find a shared representation for documents from both languages: train a classifier on source language documents and apply it on target language documents. Previous work uses a bilingual lexicon (??), machine translation (???, mt), topic models (??), cross-lingual word embeddings (?, clwe), or multilingual contextualized embeddings (?) to extract cross-lingual features. But these methods may be impossible in low-resource languages, as they require some combination of large parallel or comparable text, high-coverage dictionaries, and monolingual corpora from a shared domain.

However, as anyone who has puzzled out a Portuguese menu from their high school Spanish knows, the task is not hopeless, as languages do not exist in isolation. Shared linguistic roots, geographic proximity, and history bind languages together; cognates abound, words sound the same, and there are often shared morphological patterns. These similarities are often not found at word-level but at character-level. Therefore, we investigate character-level knowledge transfer for cldc in truly low-resource settings, where unlabeled or parallel data in the target language is also limited or unavailable.

To study knowledge transfer at character level, we propose a cldc framework, Classification Aided by Convergent Orthography (caco) that capitalizes on character-level similarities between related language pairs. Previous cldc methods treat words as atomic symbols and do not transfer character-level patterns across languages; caco instead uses a bi-level model with two components: a character-based embedder and a word-based classifier.

The embedder exploits shared patterns in related languages to create word representations from character sequences. The classifier then uses the shared representation across languages to label the document. The embedder learns morpho-semantic regularities, while the classifier connects lexical semantics to labels.

To allow cross-lingual transfer, we use a single model with shared character embeddings for both languages. We jointly train the embedder and the classifier on annotated source language documents. The embedder transfers knowledge about source language words to target language words with similar orthographic features.

While the model can be fairly accurate without any target language data, it can also benefit from a small amount of additional information when available. If we have a dictionary, pre-trained monolingual word embeddings, or parallel text, we can fine-tune the model with multi-task learning. We encourage the embedder to produce similar word embeddings for translation pairs from a dictionary, which captures patterns between cognates. We also teach the embedder to mimick pre-trained word embeddings in the source language (?), which exposes the model to more word types. When we have a good reference model in another high-resource language, we can train our model to make similar predictions as the reference model on parallel text (?).

We verify the effectiveness of character-level knowledge transfer on two cldc benchmarks. When we have enough data to learn high-quality clwe, training classifiers with clwe as input features is a strong cldc baseline. caco can match the accuracy of clwe-based models without using any target language data, and fine-tuning the embedder with a small amount of additional resources improves caco’s accuracy. Finally, caco is also useful when we have enough resources to train good clwe—using clwe as extra features, caco outperforms the baseline clwe-based models by a large margin.

2 caco: Classification Aided by Convergent Orthography

This section introduces our method, caco, which trains a multilingual document classifier using labeled datasets in a source language 𝒮 and applies the classifier to a low-resource target language 𝒯. We focus on the setting where 𝒮 and 𝒯 are related and have similar orthographic features.

2.1 Model Architecture

Let 𝐱 be an input document with a sequence of words 𝐱=𝐰1,𝐰2,,𝐰n, where each word 𝐰i is a sequence of character. Our model maps the document 𝐱 to a distribution over possible labels y in two steps (Figure 1). First, we generate a word embedding 𝐯i for each input word 𝐰i using a character-based embedder e:

𝐯i=e(𝐰i). (1)

We then feed the word embeddings to a word-based classifier f to compute the distribution over labels y:

p(y|𝐰)=f(𝐯1,𝐯2,,𝐯n). (2)

We can use any sequence model for the embedder e and the classifier f. For our experiments, we use a bidirectional lstm  (?, bi-lstm) embedder and a deep averaging network  (?, dan) classifier.

bi-lstm Embedder.

bi-lstm is a powerful sequence model that captures complex non-local dependencies. Character-based bi-lstm embedders are used in many natural language processing tasks (???). To embed a word 𝐰, we pass the character sequence 𝐰 to a left-to-right lstm and the reversed character sequence 𝐰 to a right-to-left lstm. We concatenate the final hidden states of the two lstm and apply a linear transformation:

e(𝐰)=𝐖e[lstm(𝐰);lstm(𝐰)]+𝐛e, (3)

where the functions lstm and lstm compute the final hidden states of the two lstms.

dan Classifier.

A dan is an unordered model that passes the arithmetic mean of the input word embeddings through a multilayer perceptron and feeds the final layer’s representation to a softmax layer. dan ignores cross-lingual variations in word order (i.e., syntax) and thus generalizes well in cldc. Despite its simplicity, dan has near state-of-the-art accuracies on both monolingual (?) and cross-lingual document classification (?).

Let 𝐯1,𝐯2,,𝐯n be the word embeddings generated by the character-based embedder. dan uses the average of the word embeddings as the document representation 𝐳0:

𝐳0=1ni=1n𝐯i, (4)

and 𝐳0 is passed through k layers of non-linearity:

𝐳i=g(𝐖i𝐳i-1+𝐛i), (5)

where i ranges from 1 to k, and g is a non-linear activation function. The final representation 𝐳k is passed to a softmax layer to obtain a distribution over the label y,

p(y|𝐱)=softmax(𝐖k+1𝐳k+𝐛k+1). (6)

We use the same classifier parameters 𝐖i across languages. In other words, the dan classifier is language-independent. This is possible because the embedder generates consistent word representations across related languages, which we discuss in the next section.

2.2 Character-Level Cross-Lingual Transfer

To transfer character-level information across languages, the embedder uses the same character embeddings for both languages. The character-level bi-lstm vocabulary is the union of the alphabets for the two languages, and the embedder does not differentiate identical characters from different languages. For example, a Spanish “a” has the same character embedding as a French “a”. Consequently, the embedder maps words with similar forms from both languages to similar vectors.

If the source language and the target language are orthographically similar, the embedder can generalize knowledge learned about source language words to target language words through shared orthographic features. As an example, if the model learns that the Spanish word “religioso” (religious) is predictive of label y, the model automatically infers that “religioso” in Italian is also predictive of y, even though the model never sees any Italian text.

In our experiments, we focus on related language pairs that share the same script. For related languages with different scripts, we can apply caco to the output of a transliteration tool or a grapheme-to-phoneme transducer (?). We leave this to future work.

2.3 Training Objective

Our main objective is supervised document classification. We jointly train the classifier and the embedder to minimize average negative log-likelihood on labeled source language documents S:

Ls(𝜽)=-1|S|𝐱,ylogp(y|𝐱), (7)

where 𝜽 is a vector representing all model parameters, and S is a set of source language examples with words 𝐱 and label y.

Sometimes we have additional resources for the source or target language. We use them to improve caco with multi-task learning (?) via three auxiliary tasks.

Word Translation (dict).

There are many patterns when translating cognate words between related languages. For example, Italian “e” often becomes “ie” in Spanish. “Tempo” (time) in Italian becomes “tiempo” in Spanish, and “concerto” (concert) in Italian becomes “concierto” in Spanish. The embedder can learn these word translation patterns from a bilingual dictionary.

Let D be a bilingual dictionary with a set of word pairs 𝐰s,𝐰t, where 𝐰s and 𝐰t are translations of each other. We add a term to our objective to minimize average squared Euclidean distances between the embeddings of translation pairs (?):

Ld(𝜽)=1|D|𝐰s,𝐰te(𝐰s)-e(𝐰t)22. (8)

Mimicking Word Embeddings (mim).

Monolingual text classifiers often benefit from initializing embeddings with word vectors pre-trained on large unlabeled corpus (?). This semi-supervised learning strategy helps the model generalize to word types outside labeled training data. Similarly, our embedder can mimick (?) an existing source language word embeddings to generalize better.

Suppose we have a pre-trained source language word embedding matrix 𝐄 with V rows. The i-th row 𝐱i is a vector for the i-th word type 𝐰i. We add an objective to minimize the average squared Euclidean distances between the output of the embedder and 𝐄:

Le(𝜽)=1Vi=1Ve(𝐰i)-𝐄i22. (9)
caco Baseline
src dict mim all clwe sup com
Source labeled data
Pre-trained source embedding
Small dictionary
Pre-trained clwe
Target labeled data
rcv2 average accuracy 50.0 55.7 51.5 54.7 51.6 51.9 64.5
Table 1: Comparison of models used in our experiments (introduced in Section 3.2). For each model, we list its required resources and average accuracy on rcv2 over eight related language pairs (accuracy for each pair in Table 2). We compare caco variants with two high-resource models: a clwe-based model (clwe) and a lightly supervised target language model (sup). Both baselines require more target language resources than caco variants, and yet they have lower average accuracy than some caco variants, which confirms that character-level knowledge transfer is highly efficient. We also experiment with a model that combines clwe with caco (com). This combined model has the highest average accuracy, indicating that clwe and caco are complementary when both options are available.

Knowledge Distillation.

Sometimes we have a reliable reference classifier in another high-resource language (e.g., English). If we have parallel text between 𝒮 and , we can use knowledge distillation (?) to supply additional training signal. Let P be a set of parallel documents 𝐱s,𝐱h, where 𝐱s is from source language 𝒮, and 𝐱h is the translation of 𝐱s in . We add another objective term to minimize the average Kullback-Leibler divergence between the predictions of our model and the reference model:

Lp(𝜽)=1|P|𝐱s,𝐱hPKL(ph(y|𝐱h)p(y|𝐱s)), (10)

where ph is the output of the reference classifier (in language ), and p is the output of caco. In § 3, we mark models that use knowledge distillation with a superscript “p”.

We train on the four tasks jointly. Our final objective is:

L(𝜽)=Ls(𝜽)+λdLd(𝜽)+λeLe(𝜽)+λpLp(𝜽), (11)

where the hyperparameters λd, λe, and λp trade off between the four tasks.

3 Experiments

When the source language and the target language are related, we expect character-level knowledge transfer to be more data-efficient than word-level knowledge transfer because character-level transfer allows generalization across words with similar forms. We test this by comparing caco models trained in low-resource settings and with clwe-based models trained in high-resource settings on two cldc datasets. We also compare caco with a supervised monolingual model. On both datasets, caco models have similar average accuracy as the baselines while requiring much less target language data. Finally, we train models that combine caco with clwe, which have significantly higher accuracy than models with only clwe as features. These results confirms that character-level similarities between related languages effectively transfer knowledge for cldc.

3.1 Classification Dataset

Our first dataset is Reuters multilingual corpus (rcv2), a collection of news stories labeled with four topics (?):Corporate/Industrial (ccat), Economics (ecat), Government/Social (gcat), and Markets (mcat). Following ? (?), we remove documents with multiple topic labels. For each language, we sample 1,500 training documents and 200 test documents with balanced labels. We conduct cldc experiments between two North Germanic languages, Danish (da) and Swedish (sv), and three Romance languages, French (fr), Italian (it), and Spanish (es).

To test caco on truly low-resource languages, we build a second cldc dataset with famine-related documents sampled from Tigrinya (ti) and Amharic (am) lorelei language packs (?). We train binary classifiers to detect whether the document describes widespread crime or not. For Tigrinya documents, the labels are extracted from the situation frame annotation in the language pack. We mark all documents with a “widespread crime/violence” situation frame as positive. The Amharic language pack does not have annotations, so we label Amharic sentences based on English reference translations included from the language pack. Our dataset contains 394 Tigrinya and 370 Amharic documents with balanced labels.

source target src dict mim all clwe sup com
 da  sv 56.0 62.8 60.4 62.9 69.3 59.7 69.7
 sv  da 56.7 60.2 58.4 62.2 51.4 40.8 67.5
 fr  es 49.6 59.3 48.3 57.4 63.9 56.6 70.8
 it  es 50.2 54.6 51.4 54.7 43.4 56.6 63.5
 es  fr 48.5 49.7 49.2 48.8 63.1 48.9 61.3
 it  fr 45.9 52.1 46.6 48.2 26.7 48.9 62.8
 fr  it 43.3 53.2 44.3 51.2 43.6 44.9 60.2
 es  it 49.7 53.5 53.4 52.5 51.3 44.9 59.7
average 50.0 55.7 51.5 54.7 51.6 51.9 64.5
Table 2: cldc experiments between eight related European language pairs on rcv2 topic identification. The average accuracy of caco models are competitive with word-based models that use more resources such as target language corpora or labeled data (Table 1). The combined model (com) has the highest average test accuracy. We boldface the best result for each row.
source target src mim srcp mimp clwe com
 am  ti 55.5 56.3 57.0 57.6 59.1 60.1
 ti  am 56.8 55.1 * * 58.1 59.5
Table 3: cldc experiments between Amharic and Tigrinya on lorelei disaster response dataset. caco models are only slightly worse than clwe-based models without using any target language data. For am-ti, knowledge distillation (srcp and mimp) further improves caco models. We do not experiment with knowledge distillation on ti because we cannot find enough unlabeled parallel text in the language pack. Combining caco with pre-trained clwe gives the highest test accuracy.

3.2 Models

We compare caco trained under low-resource settings with word-based models that use more resources. Table 1 summarizes our models.

caco Variants.

We experiment with several variants of caco that uses different resources. The src model uses the least amount of resource. It is only trained on labeled source language documents and do not use any unlabeled data. The dict model requires a dictionary and is trained with the word translation auxiliary task. The mim model requires a pre-trained source language embedding and uses the mimick auxliliary task. The all model is the most expensive variant. It is trained with both the word translation and the mimick auxiliary tasks. In lorelei experiments, we also use knowledge distillation to provide more classification signals for some models. We mark these models with a superscript “p”.

clwe-Based Model.

Our first word-based model is a dan with pre-trained multiCCA clwe features (?). The clwe are trained on large target language corpora with millions of tokens and high-coverage dictionaries with hundreds of thousands of word types. In contrast, we train caco models in a simulated low-resource setting with few or no target language data. Despite the resource gap, caco models have similar average test accuracy as clwe-based models, demonstrating the effectiveness of character-level transfer learning.

Supervised Model.

Next, we compare caco with a lightly-supervised monolingual model (sup), a word-based dan trained on fifty labeled target language documents. We only apply this baseline to rcv2, because the labeled document sets in lorelei are too small to split further. The supervised model requires labeled target language documents, which often do not exist in labeled documents. Without using any target language supervision, caco models have similar (and sometimes higher) test accuracies as sup, showing that caco effectively learns from a related language.

Combined Model.

Finally, we experiment with a model that combines caco and clwe (com) by feeding pre-trained clwe as additional features for the classifier of a caco model (src variant). This model requires the same amount of resource as the clwe-based model. The combined model on average has much higher accuracy than both caco variants and clwe-based model, showing that character-level knowledge transfer is useful even when we have enough unlabeled data to train high-quality clwe.

source target src dict mim all clwe
 da  es 32.5 34.8 30.6 38.2 65.7
 da  fr 34.1 41.8 35.5 43.3 45.9
 da  it 36.8 43.7 37.2 41.5 47.4
 sv  es 35.2 42.5 34.6 46.8 48.5
 sv  fr 27.4 29.9 29.1 28.3 49.0
 sv  it 34.6 36.4 33.3 35.2 40.4
average 33.4 38.2 33.4 37.2 49.5
(a) North Germanic to Romance
source target src dict mim all clwe
 es  da 47.7 48.3 46.1 52.0 56.7
 es  sv 50.6 53.7 48.5 51.4 52.4
 fr  da 46.7 44.2 44.7 48.6 45.3
 fr  sv 52.9 53.2 53.6 52.8 57.2
 it  da 36.6 43.6 34.8 43.0 48.2
 it  sv 37.8 45.3 30.7 43.9 31.1
average 45.4 48.1 43.1 48.6 48.5
(b) Romance to North Germanic
Table 4: cldc experiments between languages from different families on rcv2. When transferring from a North Germanic language to a Romance language, caco models score much lower than clwe-based models (left). Surprisingly, caco models are on par with clwe-based when transferring from a Romance language to a North Germanic language (right). We boldface the best result for each row.

3.3 Auxiliary Task Data

Some of the caco models (dict and all) use a dictionary to learn word translation patterns. We train them on the same training dictionary used for pre-training the clwe. To simulate the low-resource setting, we sample only 100 translation pairs from the original dictionary for caco. Pilot experiments confirm that a larger dictionary can help, but we focus on the low-resource setting where only a small dictionary is available.

The Amharic labeled dataset is very small compared to other languages because each Amharic example only contains one sentence. As introduced in Section 2.3, one way to provide additional training signal is by knowledge distillation from a third high-resource language. For the Amharic to Tigrinya cldc experiment, we apply knowledge distillation using English-Amharic parallel text. We first train a reference English dan on a large collection of labeled English documents compiled from other lorelei language packs. We then use the knowledge distillation objective to train the caco models to match the output of the English model on 1,200 English-Amharic parallel documents sampled from the Amharic language pack. To avoid introducing extra label bias, we sample the parallel documents such that the English model output approximately follows a uniform distribution.

We do not use knowledge distillation on other language pairs. For rcv2, we already have enough labeled examples and therefore do not need knowledge distillation. For Tigrinya to Amharic cldc experiment, we do not have enough unlabeled parallel text in the Tigrinya language pack to apply knowledge distillation.

3.4 Training Details

For clwe-based models, we use forty dimensional multiCCA word embeddings (?). We use three ReLU layers with 100 hidden units and 0.1 dropout for the clwe-based dan models and the dan classifier of the caco models. The bi-lstm embedder uses ten dimensional character embeddings and forty hidden states with no dropout. The outputs of the embedder are forty dimensional word embeddings. We set λd to 1, λe to 0.001, and λp to 1 in the multi-task objective (Equation 11). The hyperparameters are tuned in a pilot Italian-Spanish cldc experiment using held-out datasets.

All models are trained with Adam (?) with default settings. We run the optimizer for a hundred epochs with mini-batches of sixteen documents. For models that use additional resources, we also sample sixteen examples from each type of training data (translation pairs, pre-trained embeddings, or parallel text) to estimate the gradients of the auxiliary task objectives Ld, Le, and Lp (defined in Section 2.3) at each iteration.

3.5 Effectiveness of caco

We train each model using ten different random seeds and report their average test accuracy. For models that use dictionaries, we also re-sample the training dictionary for each run. Table 1 compares resource requirement and average rcv2 accuracy of caco and baselines. Table 2 and 3 show test accuracies on nine related language pairs from rcv2 and lorelei.

Character-Level Knowledge Transfer.

Experiments confirm that character-level knowledge transfer is sample-efficient and complementary to word-level knowledge transfer. The low-resource character-based caco models have similar average test accuracy as the high-resource word-based models. The src variant does not use any target language data, and yet its average test accuracy on rcv2 (50.0%) is very close to the clwe model (51.6%) and the supervised model sup (51.6%). When we already have a good clwe, we can get the best of both worlds by combining them (com), which has a much higher average test accuracy (64.5%) than caco and the two baselines.

Multi-Task Learning.

Training caco with multi-task learning further improves the accuracy. For almost all language pairs, the multi-task caco variants have higher test accuracies than src. On rcv2, word translation (dict) is particularly effective even with only 100 translation pairs. It increases average test accuracy from 50.0% to 55.7%, outperforming both word-based baseline models. Interestingly, word translation and mimick tasks together (all) do not consistently increase the accuracy over only using the dictionary (dict). On the lorelei dataset where labeled document is limited, knowledge distillation (srcp and mimp) also increases accuracies by around 1.5%.

Language Relatedness.

We expect character-level knowledge transfer to be less effective on language pairs when the source language and the target language are less close to each other. For comparison, we experiment on rcv2 with transferring between more distantly related language pairs: a North Germanic language and a Romance language (Table 4). Indeed, caco models score consistently lower than the clwe-based models when transferring from a North Germanic source language to a Romance target language. However, caco models are surprisingly competitive with clwe-based models when transferring from the opposite direction. This asymmetry is likely due to morphological differences between the two language families. Unfortunately, our datasets only have a limited number of language families. We leave a more systematic study on how language proximity affect the effectiveness of caco to future work.

Multi-Source Transfer.

Languages can be similar along different dimensions, and therefore adding more source languages may be beneficial. On rcv2, we experiment with training caco models on two Romance languages and testing on a third Romance language. Moreover, using multiple source languages has a regularization effect and prevents the model from overfitting to a single source language. For fair comparison, we sample 750 training documents from each source language, so that the multi-source models are still trained on 1,500 training documents (like the single-source models). We use a similar strategy to sample the training dictionaries and pre-trained word embeddings. Multi-source models (Table 5) consistently have higher accuracies than single-source models (Table 2).

Learned Word Representation.

Word translation is a popular intrinsic evaluation task for cross-lingual word representations. Therefore, we evaluate the word representations learned by the bi-lstm embedder on a word translation benchmark. Specifically, we use the src embedder to generate embeddings for all French, Italian, and Spanish words that appear in multiCCA’s vocabulary and translate each word with nearest-neighbor search. Table 6 shows the top-1 word translation accuracy on the test dictionaries from muse (?). Although the src embedder is not exposed to any cross-lingual signal, it rivals clwe on the word translation task by exploiting character-level similarities between languages.

Qualitative Analysis.

To understand how cross-lingual character-level similarity helps classification, we manually compare the output of a clwe-based model and a caco model (dict variant) from the Spanish to Italian cldc experiment. Sometimes caco avoids the mistakes of clwe-based models by correctly aligning word pairs that are misaligned in the pre-trained clwe. For example, in the clwe, “relevancia” (relevance) is the closest Spanish word for the Italian word “interesse” (interest), while the caco embedder maps both the Italian word “interesse” (interest) and the Spanish word “interesse” (interest) to the same point. Consequently, caco correctly classifies an Italian document about the interest rate with gcat (government), while the clwe-based model predicts mcat (market).

4 Related Work

Previous cldc methods are typically word-based and rely on one of the following cross-lingual signals to transfer knowledge: large bilingual lexicons (??), mt systems  (???), or clwe (?). One exception is the recently proposed multilingual BERT (??), which uses a subword vocabulary. Unfortunately, some languages do not have these resources. caco can help bridge the resource gap. By exploiting character-level similarities between related languages, caco can work effectively with few or no target language data.

To adapt clwe to low-resource settings, recent unsupervised clwe methods (??) do not use dictionary or parallel text. These methods can be further improved with careful normalization (?) and interactive refinement (?). However, unsupervised clwe methods still require large monolingual corpora in the target language, and they might fail when the monolingual corpora of the two languages come from different domains (??) and when the two language have different morphology (?). In contrast, caco does not require any target language data.

Cross-lingual transfer at character-level is successfully used in low-resource paradigm completion (?), morphological tagging (?), part-of-speech tagging (?), and named entity recognition (????), where the authors train a character-level model jointly on a small labeled corpus in target language and a large labeled corpus in source language. Our method is similar in spirit, but we focus on cldc, where it is less obvious if orthographic features are helpful. Moreover, we introduce a novel multi-task objective to use different types of monolingual and cross-lingual resources.

source target src dict mim all
  fr/it  es 58.8 67.0 55.8 65.3
  es/it  fr 51.8 55.8 50.3 56.0
  es/fr  it 53.2 56.1 55.9 56.5
average 54.6 59.6 54.0 59.3
Table 5: Results of cldc experiments using two source languages. Models trained on two source languages are generally better than models trained on only one source language (Table 2).
source target clwe caco
 es  fr 36.8 31.1
 es  it 44.0 33.1
 fr  es 34.0 30.9
 fr  it 33.5 29.6
 it  es 42.1 37.5
 it  fr 35.6 36.4
average 37.7 33.1
Table 6: Word translation accuracies ([email protected]) for different embeddings. The caco embeddings are generated by the embedder of a src model trained on the source language. Without any cross-lingual signal, the caco embedder has competitive word translation accuracy as clwe pre-trained on large target language corpora and dictionaries.

5 Conclusion

We investigate character-level knowledge transfer between related languages for cldc. Our transfer learning scheme, caco, exploits character-level similarities between related languages through shared character representations to generalize from source language data. Empirical evaluation on multiple related language pairs confirm that character-level knowledge transfer is highly effective.


We thank the members of UMD CLIP and the anonymous reviewers for their feedback. Zhang and Boyd-Graber are supported by DARPA award HR0011-15-C-0113 under subcontract to Raytheon BBN Technologies. Fujinuma and Boyd-Graber are supported by NSF grant IIS-1564275. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.


  • [Ammar et al. 2016] Ammar, W.; Mulcaire, G.; Tsvetkov, Y.; Lample, G.; Dyer, C.; and Smith, N. A. 2016. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
  • [Andrade et al. 2015] Andrade, D.; Sadamasa, K.; Tamura, A.; and Tsuchida, M. 2015. Cross-lingual text classification using topic-dependent word probabilities. In NAACL.
  • [Artetxe, Labaka, and Agirre 2018] Artetxe, M.; Labaka, G.; and Agirre, E. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In ACL.
  • [Ballesteros, Dyer, and Smith 2015] Ballesteros, M.; Dyer, C.; and Smith, N. A. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP.
  • [Banea et al. 2008] Banea, C.; Mihalcea, R.; Wiebe, J.; and Hassan, S. 2008. Multilingual subjectivity analysis using machine translation. In EMNLP.
  • [Bharadwaj et al. 2016] Bharadwaj, A.; Mortensen, D. R.; Dyer, C.; and Carbonell, J. G. 2016. Phonologically aware neural model for named entity recognition in low resource transfer settings. In EMNLP.
  • [Chen et al. 2018] Chen, X.; Sun, Y.; Athiwaratkun, B.; Cardie, C.; and Weinberger, K. 2018. Adversarial deep averaging networks for cross-lingual sentiment classification. TACL 6:557–570.
  • [Collobert et al. 2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural language processing (almost) from scratch. JMLR 12:2493–2537.
  • [Conneau et al. 2018] Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and Jégou, H. 2018. Word translation without parallel data. In ICLR.
  • [Cotterell and Duh 2017] Cotterell, R., and Duh, K. 2017. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In IJCNLP.
  • [Cotterell and Heigold 2017] Cotterell, R., and Heigold, G. 2017. Cross-lingual character-level neural morphological tagging. In EMNLP.
  • [Czarnowska et al. 2019] Czarnowska, P.; Ruder, S.; Grave, E.; Cotterell, R.; and Copestake, A. 2019. Don’t forget the long tail! a comprehensive analysis of morphological generalization in bilingual lexicon induction. In EMNLP.
  • [Devlin et al. 2019] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
  • [Fujinuma, Boyd-Graber, and Paul 2019] Fujinuma, Y.; Boyd-Graber, J.; and Paul, M. J. 2019. A resource-free evaluation metric for cross-lingual word embeddings based on graph modularity. In ACL.
  • [Graves and Schmidhuber 2005] Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6):602–610.
  • [Iyyer et al. 2015] Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daumé III, H. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL.
  • [Kann, Cotterell, and Schütze 2017] Kann, K.; Cotterell, R.; and Schütze, H. 2017. One-shot neural cross-lingual transfer for paradigm completion. In ACL.
  • [Kim et al. 2017] Kim, J.-K.; Kim, Y.-B.; Sarikaya, R.; and Fosler-Lussier, E. 2017. Cross-lingual transfer learning for POS tagging without cross-lingual resources. In EMNLP.
  • [Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
  • [Klementiev, Titov, and Bhattarai 2012] Klementiev, A.; Titov, I.; and Bhattarai, B. 2012. Inducing crosslingual distributed representations of words. In COLING.
  • [Lample et al. 2016] Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In NAACL.
  • [Lewis et al. 2004] Lewis, D. D.; Yang, Y.; Rose, T. G.; and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. JMLR 5(Apr):361–397.
  • [Lin et al. 2018] Lin, Y.; Yang, S.; Stoyanov, V.; and Ji, H. 2018. A multi-lingual multi-task architecture for low-resource sequence labeling. In ACL.
  • [Ling et al. 2015] Ling, W.; Dyer, C.; Black, A. W.; Trancoso, I.; Fermandez, R.; Amir, S.; Marujo, L.; and Luís, T. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In EMNLP.
  • [Mikolov, Le, and Sutskever 2013] Mikolov, T.; Le, Q. V.; and Sutskever, I. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
  • [Mimno et al. 2009] Mimno, D.; Wallach, H.; Naradowsky, J.; Smith, D.; and McCallum, A. 2009. Polylingual topic models. In EMNLP.
  • [Mortensen, Dalmia, and Littell 2018] Mortensen, D. R.; Dalmia, S.; and Littell, P. 2018. Epitran: Precision G2P for many languages. In LREC.
  • [Pinter, Guthrie, and Eisenstein 2017] Pinter, Y.; Guthrie, R.; and Eisenstein, J. 2017. Mimicking word embeddings using subword RNNs. In EMNLP.
  • [Rijhwani et al. 2019] Rijhwani, S.; Xie, J.; Neubig, G.; and Carbonell, J. G. 2019. Zero-shot neural transfer for cross-lingual entity linking. In AAAI.
  • [Shi, Mihalcea, and Tian 2010] Shi, L.; Mihalcea, R.; and Tian, M. 2010. Cross language text classification by model translation and semi-supervised learning. In EMNLP.
  • [Søgaard, Ruder, and Vulić 2018] Søgaard, A.; Ruder, S.; and Vulić, I. 2018. On the limitations of unsupervised bilingual dictionary induction. In ACL.
  • [Strassel and Tracey 2016] Strassel, S., and Tracey, J. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In LREC.
  • [Wan 2009] Wan, X. 2009. Co-training for cross-lingual sentiment classification. In ACL.
  • [Wu and Dredze 2019] Wu, S., and Dredze, M. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In EMNLP.
  • [Xu and Yang 2017] Xu, R., and Yang, Y. 2017. Cross-lingual distillation for text classification. In ACL.
  • [Yuan et al. 2019] Yuan, M.; Zhang, M.; Durme, B. V.; Findlater, L.; and Boyd-Graber, J. 2019. Interactive refinement of cross-lingual word embeddings. arXiv preprint arXiv:1911.03070.
  • [Yuan, Van Durme, and Boyd-Graber 2018] Yuan, M.; Van Durme, B.; and Boyd-Graber, J. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. In NeurIPS.
  • [Zhang et al. 2019] Zhang, M.; Xu, K.; Kawarabayashi, K.; Jegelka, S.; and Boyd-Graber, J. 2019. Are girls neko or shōjo? Cross-lingual alignment of non-isomorphic embeddings with Iterative Normalization. In ACL.
  • [Zhou, Wan, and Xiao 2016] Zhou, X.; Wan, X.; and Xiao, J. 2016. Cross-lingual sentiment classification with bilingual document representation learning. In ACL.