Neural Cross-Lingual Relation Extraction Based on Bilingual Word Embedding Mapping

  • 2019-10-31 19:30:54
  • Jian Ni, Radu Florian
  • 0

Abstract

Relation extraction (RE) seeks to detect and classify semantic relationshipsbetween entities, which provides useful information for many NLP applications.Since the state-of-the-art RE models require large amounts of manuallyannotated data and language-specific resources to achieve high accuracy, it isvery challenging to transfer an RE model of a resource-rich language to aresource-poor language. In this paper, we propose a new approach forcross-lingual RE model transfer based on bilingual word embedding mapping. Itprojects word embeddings from a target language to a source language, so that awell-trained source-language neural network RE model can be directly applied tothe target language. Experiment results show that the proposed approachachieves very good performance for a number of target languages on bothin-house and open datasets, using a small bilingual dictionary with only 1Kword pairs.

 

Quick Read (beta)

Neural Cross-Lingual Relation Extraction Based on
Bilingual Word Embedding Mapping

Jian Ni    Radu Florian
IBM Research AI
1101 Kitchawan Road, Yorktown Heights, NY 10598, USA
{nij, raduf}@us.ibm.com
Abstract

Relation extraction (RE) seeks to detect and classify semantic relationships between entities, which provides useful information for many NLP applications. Since the state-of-the-art RE models require large amounts of manually annotated data and language-specific resources to achieve high accuracy, it is very challenging to transfer an RE model of a resource-rich language to a resource-poor language. In this paper, we propose a new approach for cross-lingual RE model transfer based on bilingual word embedding mapping. It projects word embeddings from a target language to a source language, so that a well-trained source-language neural network RE model can be directly applied to the target language. Experiment results show that the proposed approach achieves very good performance for a number of target languages on both in-house and open datasets, using a small bilingual dictionary with only 1K word pairs.

Neural Cross-Lingual Relation Extraction Based on
Bilingual Word Embedding Mapping


Jian Ni and Radu Florian IBM Research AI 1101 Kitchawan Road, Yorktown Heights, NY 10598, USA {nij, raduf}@us.ibm.com

1 Introduction

Relation extraction (RE) is an important information extraction task that seeks to detect and classify semantic relationships between entities like persons, organizations, geo-political entities, locations, and events. It provides useful information for many NLP applications such as knowledge base construction, text mining and question answering. For example, the entity Washington, D.C. and the entity United States have a CapitalOf relationship, and extraction of such relationships can help answer questions like “What is the capital city of the United States?”

Traditional RE models (e.g., Zelenko et al. (2003); Kambhatla (2004); Li and Ji (2014)) require careful feature engineering to derive and combine various lexical, syntactic and semantic features. Recently, neural network RE models (e.g., Zeng et al. (2014); dos Santos et al. (2015); Miwa and Bansal (2016); Nguyen and Grishman (2016)) have become very successful. These models employ a certain level of automatic feature learning by using word embeddings, which significantly simplifies the feature engineering task while considerably improving the accuracy, achieving the state-of-the-art performance for relation extraction.

All the above RE models are supervised machine learning models that need to be trained with large amounts of manually annotated RE data to achieve high accuracy. However, annotating RE data by human is expensive and time-consuming, and can be quite difficult for a new language. Moreover, most RE models require language-specific resources such as dependency parsers and part-of-speech (POS) taggers, which also makes it very challenging to transfer an RE model of a resource-rich language to a resource-poor language.

There are a few existing weakly supervised cross-lingual RE approaches that require no human annotation in the target languages, e.g., Kim et al. (2010); Kim and Lee (2012); Faruqui and Kumar (2015); Zou et al. (2018). However, the existing approaches require aligned parallel corpora or machine translation systems, which may not be readily available in practice.

In this paper, we make the following contributions to cross-lingual RE:

  • We propose a new approach for direct cross-lingual RE model transfer based on bilingual word embedding mapping. It projects word embeddings from a target language to a source language (e.g., English), so that a well-trained source-language RE model can be directly applied to the target language, with no manually annotated RE data needed for the target language.

  • We design a deep neural network architecture for the source-language (English) RE model that uses word embeddings and generic language-independent features as the input. The English RE model achieves the-state-of-the-art performance without using language-specific resources.

  • We conduct extensive experiments which show that the proposed approach achieves very good performance (up to 79% of the accuracy of the supervised target-language RE model) for a number of target languages on both in-house and the ACE05 datasets (Walker et al., 2006), using a small bilingual dictionary with only 1K word pairs. To the best of our knowledge, this is the first work that includes empirical studies for cross-lingual RE on several languages across a variety of language families, without using aligned parallel corpora or machine translation systems.

We organize the paper as follows. In Section 2 we provide an overview of our approach. In Section 3 we describe how to build monolingual word embeddings and learn a linear mapping between two languages. In Section 4 we present a neural network architecture for the source-language (English). In Section 5 we evaluate the performance of the proposed approach for a number of target languages. We discuss related work in Section 6 and conclude the paper in Section 7.

Figure 1: Neural cross-lingual relation extraction based on bilingual word embedding mapping - target language: Portuguese, source language: English.

2 Overview of the Approach

We summarize the main steps of our neural cross-lingual RE model transfer approach as follows.

  • 1.

    Build word embeddings for the source language and the target language separately using monolingual data.

  • 2.

    Learn a linear mapping that projects the target-language word embeddings into the source-language embedding space using a small bilingual dictionary.

  • 3.

    Build a neural network source-language RE model that uses word embeddings and generic language-independent features as the input.

  • 4.

    For a target-language sentence and any two entities in it, project the word embeddings of the words in the sentence to the source-language word embeddings using the linear mapping, and then apply the source-language RE model on the projected word embeddings to classify the relationship between the two entities. An example is shown in Figure 1, where the target language is Portuguese and the source language is English.

We will describe each component of our approach in the subsequent sections.

3 Cross-Lingual Word Embeddings

In recent years, vector representations of words, known as word embeddings, become ubiquitous for many NLP applications (Collobert et al., 2011; Mikolov et al., 2013a; Pennington et al., 2014).

A monolingual word embedding model maps words in the vocabulary 𝒱 of a language to real-valued vectors in d×1. The dimension of the vector space d is normally much smaller than the size of the vocabulary V=|𝒱| for efficient representation. It also aims to capture semantic similarities between the words based on their distributional properties in large samples of monolingual data.

Cross-lingual word embedding models try to build word embeddings across multiple languages (Upadhyay et al., 2016; Ruder et al., 2017). One approach builds monolingual word embeddings separately and then maps them to the same vector space using a bilingual dictionary (Mikolov et al., 2013b; Faruqui and Dyer, 2014). Another approach builds multilingual word embeddings in a shared vector space simultaneously, by generating mixed language corpora using aligned sentences (Luong et al., 2015; Gouws et al., 2015).

In this paper, we adopt the technique in (Mikolov et al., 2013b) because it only requires a small bilingual dictionary of aligned word pairs, and does not require parallel corpora of aligned sentences which could be more difficult to obtain.

3.1 Monolingual Word Embeddings

To build monolingual word embeddings for the source and target languages, we use a variant of the Continuous Bag-of-Words (CBOW) word2vec model (Mikolov et al., 2013a).

The standard CBOW model has two matrices, the input word matrix 𝐗~d×V and the output word matrix 𝐗d×V. For the ith word wi in 𝒱, let 𝐞(wi)V×1 be a one-hot vector with 1 at index i and 0s at other indexes, so that 𝐱~i=𝐗~𝐞(wi) (the ith column of 𝐗~) is the input vector representation of word wi, and 𝐱i=𝐗𝐞(wi) (the ith column of 𝐗) is the output vector representation (i.e., word embedding) of word wi.

Given a sequence of training words w1,w2,,wN, the CBOW model seeks to predict a target word wt using a window of 2c context words surrounding wt, by maximizing the following objective function:

=1Nt=1NlogP(wt|wt-c,,wt-1,wt+1,,wt+c)

The conditional probability is calculated using a softmax function:

P(wt|wt-c,,wt+c)=exp(𝐱tT𝐱~c(t))i=1Vexp(𝐱iT𝐱~c(t)) (1)

where 𝐱t=𝐗𝐞(wt) is the output vector representation of word wt, and

𝐱~c(t)=-cjc,j0𝐗~𝐞(wt+j) (2)

is the sum of the input vector representations of the context words.

In our variant of the CBOW model, we use a separate input word matrix 𝐗~j for a context word at position j,-cjc,j0. In addition, we employ weights that decay with the distances of the context words to the target word. Under these modifications, we have

𝐱~c(t)new=-cjc,j01|j|𝐗~j𝐞(wt+j) (3)

We use the variant to build monolingual word embeddings because experiments on named entity recognition and word similarity tasks showed this variant leads to small improvements over the standard CBOW model (Ni et al., 2017).

3.2 Bilingual Word Embedding Mapping

Mikolov et al. (2013b) observed that word embeddings of different languages often have similar geometric arrangements, and suggested to learn a linear mapping between the vector spaces.

Let 𝒟 be a bilingual dictionary with aligned word pairs (wi,vi)i=1,,D between a source language s and a target language t, where wi is a source-language word and vi is the translation of wi in the target language. Let 𝐱id×1 be the word embedding of the source-language word wi, 𝐲id×1 be the word embedding of the target-language word vi.

We find a linear mapping (matrix) 𝐌ts such that 𝐌ts𝐲i approximates 𝐱i, by solving the following least squares problem using the dictionary as the training set:

𝐌ts=argmin𝐌d×di=1D||𝐱i-𝐌𝐲i||2 (4)

Using 𝐌ts, for any target-language word v with word embedding 𝐲, we can project it into the source-language embedding space as 𝐌ts𝐲.

3.2.1 Length Normalization and Orthogonal Transformation

To ensure that all the training instances in the dictionary 𝒟 contribute equally to the optimization objective in (4) and to preserve vector norms after projection, we have tried length normalization and orthogonal transformation for learning the bilingual mapping as in (Xing et al., 2015; Artetxe et al., 2016; Smith et al., 2017).

First, we normalize the source-language and target-language word embeddings to be unit vectors: 𝐱=𝐱||𝐱|| for each source-language word embedding 𝐱, and 𝐲=𝐲||𝐲|| for each target-language word embedding 𝐲.

Next, we add an orthogonality constraint to (4) such that 𝐌 is an orthogonal matrix, i.e., 𝐌T𝐌=𝐈 where 𝐈 denotes the identity matrix:

𝐌tsO=argmin𝐌d×d,𝐌T𝐌=𝐈i=1D||𝐱i-𝐌𝐲i||2 (5)

𝐌tsO can be computed using singular-value decomposition (SVD).

3.2.2 Semi-Supervised and Unsupervised Mappings

The mapping learned in (4) or (5) requires a seed dictionary. To relax this requirement, Artetxe et al. (2017) proposed a self-learning procedure that can be combined with a dictionary-based mapping technique. Starting with a small seed dictionary, the procedure iteratively 1) learns a mapping using the current dictionary; and 2) computes a new dictionary using the learned mapping.

Artetxe et al. (2018) proposed an unsupervised method to learn the bilingual mapping without using a seed dictionary. The method first uses a heuristic to build an initial dictionary that aligns the vocabularies of two languages, and then applies a robust self-learning procedure to iteratively improve the mapping. Another unsupervised method based on adversarial training was proposed in Conneau et al. (2018).

We compare the performance of different mappings for cross-lingual RE model transfer in Section 5.3.2.

4 Neural Network RE Models

For any two entities in a sentence, an RE model determines whether these two entities have a relationship, and if yes, classifies the relationship into one of the pre-defined relation types. We focus on neural network RE models since these models achieve the state-of-the-art performance for relation extraction. Most importantly, neural network RE models use word embeddings as the input, which are amenable to cross-lingual model transfer via cross-lingual word embeddings. In this paper, we use English as the source language.

Our neural network architecture has four layers. The first layer is the embedding layer which maps input words in a sentence to word embeddings. The second layer is a context layer which transforms the word embeddings to context-aware vector representations using a recurrent or convolutional neural network layer. The third layer is a summarization layer which summarizes the vectors in a sentence by grouping and pooling. The final layer is the output layer which returns the classification label for the relation type.

4.1 Embedding Layer

For an English sentence with n words 𝐬=(w1,w2,,wn), the embedding layer maps each word wt to a real-valued vector (word embedding) 𝐱td×1 using the English word embedding model (Section 3.1). In addition, for each entity m in the sentence, the embedding layer maps its entity type to a real-valued vector (entity label embedding) 𝐥mdm×1 (initialized randomly). In our experiments we use d=300 and dm=50.

4.2 Context Layer

Given the word embeddings 𝐱t’s of the words in the sentence, the context layer tries to build a sentence-context-aware vector representation for each word. We consider two types of neural network layers that aim to achieve this.

4.2.1 Bi-LSTM Context Layer

The first type of context layer is based on Long Short-Term Memory (LSTM) type recurrent neural networks (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005). Recurrent neural networks (RNNs) are a class of neural networks that operate on sequential data such as sequences of words. LSTM networks are a type of RNNs that have been invented to better capture long-range dependencies in sequential data.

We pass the word embeddings 𝐱t’s to a forward and a backward LSTM layer. A forward or backward LSTM layer consists of a set of recurrently connected blocks known as memory blocks. The memory block at the t-th word in the forward LSTM layer contains a memory cell 𝐜t and three gates: an input gate 𝐢t, a forget gate 𝐟t and an output gate 𝐨t ( indicates the forward direction), which are updated as follows:

𝐢t = σ(Wi𝐱t+Ui𝐡t-1+𝐛i)
𝐟t = σ(Wf𝐱t+Uf𝐡t-1+𝐛f)
𝐨t = σ(Wo𝐱t+Uo𝐡t-1+𝐛o)
𝐜t = 𝐟t𝐜t-1+
𝐢ttanh(Wc𝐱t+Uc𝐡t-1+𝐛c)
𝐡t = 𝐨ttanh(𝐜t) (6)

where σ is the element-wise sigmoid function and is the element-wise multiplication.

The hidden state vector 𝐡t in the forward LSTM layer incorporates information from the left (past) tokens of wt in the sentence. Similarly, we can compute the hidden state vector 𝐡t in the backward LSTM layer, which incorporates information from the right (future) tokens of wt in the sentence. The concatenation of the two vectors 𝐡t=[𝐡t,𝐡t] is a good representation of the word wt with both left and right contextual information in the sentence.

4.2.2 CNN Context Layer

The second type of context layer is based on Convolutional Neural Networks (CNNs) (Zeng et al., 2014; dos Santos et al., 2015), which applies convolution-like operation on successive windows of size k around each word in the sentence. Let 𝐳t=[𝐱t-(k-1)/2,,𝐱t+(k-1)/2] be the concatenation of k word embeddings around wt. The convolutional layer computes a hidden state vector

𝐡t=tanh(𝐖𝐳t+𝐛) (7)

for each word wt, where 𝐖 is a weight matrix and 𝐛 is a bias vector, and tanh() is the element-wise hyperbolic tangent function.

4.3 Summarization Layer

After the context layer, the sentence (w1,w2,,wn) is represented by (𝐡1,.,𝐡n). Suppose m1=(wb1,..,we1) and m2=(wb2,..,we2) are two entities in the sentence where m1 is on the left of m2 (i.e., e1<b2). As different sentences and entities may have various lengths, the summarization layer tries to build a fixed-length vector that best summarizes the representations of the sentence and the two entities for relation type classification.

We divide the hidden state vectors 𝐡t’s into 5 groups:

  • G1={𝐡1,..,𝐡b1-1} includes vectors that are left to the first entity m1.

  • G2={𝐡b1,..,𝐡e1} includes vectors that are in the first entity m1.

  • G3={𝐡e1+1,..,𝐡b2-1} includes vectors that are between the two entities.

  • G4={𝐡b2,..,𝐡e2} includes vectors that are in the second entity m2.

  • G5={𝐡e2+1,..,𝐡n} includes vectors that are right to the second entity m2.

We perform element-wise max pooling among the vectors in each group:

𝐡Gi(j)=max𝐡Gi𝐡(j),1jdh,1i5 (8)

where dh is the dimension of the hidden state vectors. Concatenating the 𝐡Gi’s we get a fixed-length vector 𝐡s=[𝐡G1,,𝐡G5].

4.4 Output Layer

The output layer receives inputs from the previous layers (the summarization vector 𝐡s, the entity label embeddings 𝐥m1 and 𝐥m2 for the two entities under consideration) and returns a probability distribution over the relation type labels:

𝐩=softmax(𝐖s𝐡s+𝐖m1𝐥m1+𝐖m2𝐥m2+𝐛o) (9)

4.5 Cross-Lingual RE Model Transfer

Given the word embeddings of a sequence of words in a target language t, (𝐲1,,𝐲n), we project them into the English embedding space by applying the linear mapping 𝐌ts learned in Section 3.2: (𝐌ts𝐲1,𝐌ts𝐲2,,𝐌ts𝐲n). The neural network English RE model is then applied on the projected word embeddings and the entity label embeddings (which are language independent) to perform relationship classification.

Note that our models do not use language-specific resources such as dependency parsers or POS taggers because these resources might not be readily available for a target language. Also our models do not use precise word position features since word positions in sentences can vary a lot across languages.

5 Experiments

In this section, we evaluate the performance of the proposed cross-lingual RE approach on both in-house dataset and the ACE (Automatic Content Extraction) 2005 multilingual dataset (Walker et al., 2006).

5.1 Datasets

Our in-house dataset includes manually annotated RE data for 6 languages: English, German, Spanish, Italian, Japanese and Portuguese. It defines 56 entity types (e.g., Person, Organization, Geo-Political Entity, Location, Facility, Time, Event_Violence, etc.) and 53 relation types between the entities (e.g., AgentOf, LocatedAt, PartOf, TimeOf, AffectedBy, etc.).

The ACE05 dataset includes manually annotated RE data for 3 languages: English, Arabic and Chinese. It defines 7 entity types (Person, Organization, Geo-Political Entity, Location, Facility, Weapon, Vehicle) and 6 relation types between the entities (Agent-Artifact, General-Affiliation, ORG-Affiliation, Part-Whole, Personal-Social, Physical).

For both datasets, we create a class label “O” to denote that the two entities under consideration do not have a relationship belonging to one of the relation types of interest.

Figure 2: Cross-lingual RE performance (F1 score) vs. dictionary size (number of bilingual word pairs for learning the mapping (4)) under the Bi-LSTM English RE model on the target-language development data.

5.2 Source (English) RE Model Performance

We build 3 neural network English RE models under the architecture described in Section 4:

  • The first neural network RE model does not have a context layer and the word embeddings are directly passed to the summarization layer. We call it Pass-Through for short.

  • The second neural network RE model has a Bi-LSTM context layer. We call it Bi-LSTM for short.

  • The third neural network model has a CNN context layer with a window size 3. We call it CNN for short.


Model
𝐅𝟏
FCM (S) (Gormley et al., 2015) 55.06
Hybrid FCM (E) (Gormley et al., 2015) 58.26
BIDIRECT (S) (Nguyen and Grishman, 2016) 57.73
VOTE-BW (E) (Nguyen and Grishman, 2016) 60.60
Pass-Through (S) 54.99
Bi-LSTM (S) 58.92
CNN (S) 57.91
Table 1: Comparison with the state-of-the-art RE models on the ACE05 English data (S: Single Model; E: Ensemble Model).

In-House
Training Dev Test
English (Source) 1137 140 140
German (Target) 280 35 35
Spanish (Target) 451 55 55
Italian (Target) 322 40 40
Japanese (Target) 396 50 50
Portuguese (Target) 390 50 50

ACE05
Training Dev Test
English (Source) 479 60 60
Arabic (Target) 323 40 40
Chinese (Target) 507 63 63
Table 2: Number of documents in the training/dev/test sets of the in-house and ACE05 datasets.

First we compare our neural network English RE models with the state-of-the-art RE models on the ACE05 English data. The ACE05 English data can be divided to 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl). We apply the same data split in (Plank and Moschitti, 2013; Gormley et al., 2015; Nguyen and Grishman, 2016), which uses news (the union of bn and nw) as the training set, a half of bc as the development set and the remaining data as the test set.

We learn the model parameters using Adam (Kingma and Ba, 2015). We apply dropout (Srivastava et al., 2014) to the hidden layers to reduce overfitting. The development set is used for tuning the model hyperparameters and for early stopping.

In Table 1 we compare our models with the best models in (Gormley et al., 2015) and (Nguyen and Grishman, 2016). Our Bi-LSTM model outperforms the best model (single or ensemble) in (Gormley et al., 2015) and the best single model in (Nguyen and Grishman, 2016), without using any language-specific resources such as dependency parsers.

While the data split in the previous works was motivated by domain adaptation, the focus of this paper is on cross-lingual model transfer, and hence we apply a random data split as follows. For the source language English and each target language, we randomly select 80% of the data as the training set, 10% as the development set, and keep the remaining 10% as the test set. The sizes of the sets are summarized in Table 2.

We report the Precision, Recall and F1 score of the 3 neural network English RE models in Table 3. Note that adding an additional context layer with either Bi-LSTM or CNN significantly improves the performance of our English RE model, compared with the simple Pass-Through model. Therefore, we will focus on the Bi-LSTM model and the CNN model in the subsequent experiments.


Model
In-House ACE05
P R 𝐅𝟏 P R 𝐅𝟏
Pass-Through 60.8 63.1 61.9 59.4 55.1 57.2
Bi-LSTM 66.5 67.7 67.1 65.1 65.8 65.5
CNN 63.4 68.8 66.0 61.7 67.1 64.3
Table 3: Performance of the supervised English RE models on the in-house and ACE05 English test data.

5.3 Cross-Lingual RE Performance

We apply the English RE models to the 7 target languages across a variety of language families.

5.3.1 Dictionary Size

The bilingual dictionary includes the most frequent target-language words and their translations in English. To determine how many word pairs are needed to learn an effective bilingual word embedding mapping for cross-lingual RE, we first evaluate the performance (F1 score) of our cross-lingual RE approach on the target-language development sets with an increasing dictionary size, as plotted in Figure 2.

We found that for most target languages, once the dictionary size reaches 1K, further increasing the dictionary size may not improve the transfer performance. Therefore, we select the dictionary size to be 1K.

5.3.2 Comparison of Different Mappings


Mapping
German Spanish Italian Japanese Portuguese Arabic Chinese Average
Regular-1K 41.4 53.0 38.7 30.0 46.4 34.4 49.0 41.8
Orthogonal-1K 41.0 49.2 35.7 27.1 44.0 32.1 47.2 39.5
Semi-Supervised-1K 38.1 46.7 31.2 25.3 43.1 37.3 47.1 38.4
Unsupervised 37.8 45.5 28.9 21.2 40.8 37.5 49.1 37.3
Table 4: Comparison of the performance (F1 score) using different mappings on the target-language development data under the Bi-LSTM model.

We compare the performance of cross-lingual RE model transfer under the following bilingual word embedding mappings:

  • Regular-1K: the regular mapping learned in (4) using 1K word pairs;

  • Orthogonal-1K: the orthogonal mapping with length normalization learned in (5) using 1K word pairs (in this case we train the English RE models with the normalized English word embeddings);

  • Semi-Supervised-1K: the mapping learned with 1K word pairs and improved by the self-learning method in (Artetxe et al., 2017);

  • Unsupervised: the mapping learned by the unsupervised method in (Artetxe et al., 2018).

The results are summarized in Table 4. The regular mapping outperforms the orthogonal mapping consistently across the target languages. While the orthogonal mapping was shown to work better than the regular mapping for the word translation task (Xing et al., 2015; Artetxe et al., 2016; Smith et al., 2017), our cross-lingual RE approach directly maps target-language word embeddings to the English embedding space without conducting word translations. Moreover, the orthogonal mapping requires length normalization, but we observed that length normalization adversely affects the performance of the English RE models (about 2.0 F1 points drop).

We apply the vecmap toolkit11 1 https://github.com/artetxem/vecmap to obtain the semi-supervised and unsupervised mappings. The unsupervised mapping has the lowest average accuracy over the target languages, but it does not require a seed dictionary. Among all the mappings, the regular mapping achieves the best average accuracy over the target languages using a dictionary with only 1K word pairs, and hence we adopt it for the cross-lingual RE task.


Model
German Spanish Italian Japanese Portuguese
P R 𝐅𝟏 P R 𝐅𝟏 P R 𝐅𝟏 P R 𝐅𝟏 P R 𝐅𝟏
Bi-LSTM 39.6 48.9 43.8 54.5 47.6 50.8 41.8 34.2 37.6 33.9 25.1 28.9 52.9 44.5 48.4
CNN 32.5 50.5 39.5 49.3 48.3 48.8 36.6 34.9 35.7 27.3 31.5 29.3 49.0 44.0 46.3
Ensemble 39.6 50.5 44.4 56.9 49.1 52.7 42.6 35.3 38.6 35.3 26.4 30.2 54.9 45.2 49.6
Supervised 59.3 56.4 57.8 68.4 65.4 66.8 51.4 48.3 49.8 52.7 52.0 52.4 64.0 61.3 62.6

Table 5: Performance of the cross-lingual RE approach on the in-house target-language test data.

Model
Arabic Chinese
P R 𝐅𝟏 P R 𝐅𝟏
Bi-LSTM 30.3 45.7 36.4 61.7 37.8 46.8
CNN 24.0 39.7 29.9 56.4 33.8 42.3
Ensemble 27.5 48.7 35.2 61.0 40.4 48.6
Supervised 70.0 69.1 69.5 66.9 69.4 68.1
Table 6: Performance of the cross-lingual RE approach on the ACE05 target-language test data.

5.3.3 Performance on Test Data

The cross-lingual RE model transfer results for the in-house test data are summarized in Table 5 and the results for the ACE05 test data are summarized in Table 6, using the regular mapping learned with a bilingual dictionary of size 1K. In the tables, we also provide the performance of the supervised RE model (Bi-LSTM) for each target language, which is trained with a few hundred thousand tokens of manually annotated RE data in the target-language, and may serve as an upper bound for the cross-lingual model transfer performance.

Among the 2 neural network models, the Bi-LSTM model achieves a better cross-lingual RE performance than the CNN model for 6 out of the 7 target languages. In terms of absolute performance, the Bi-LSTM model achieves over 40.0 F1 scores for German, Spanish, Portuguese and Chinese. In terms of relative performance, it reaches over 75% of the accuracy of the supervised target-language RE model for German, Spanish, Italian and Portuguese. While Japanese and Arabic appear to be more difficult to transfer, it still achieves 55% and 52% of the accuracy of the supervised Japanese and Arabic RE model, respectively, without using any manually annotated RE data in Japanese/Arabic.

We apply model ensemble to further improve the accuracy of the Bi-LSTM model. We train 5 Bi-LSTM English RE models initiated with different random seeds, apply the 5 models on the target languages, and combine the outputs by selecting the relation type labels with the highest probabilities among the 5 models. This Ensemble approach improves the single model by 0.6-1.9 F1 points, except for Arabic.

5.3.4 Discussion

Since our approach projects the target-language word embeddings to the source-language embedding space preserving the word order, it is expected to work better for a target language that has more similar word order as the source language. This has been verified by our experiments. The source language, English, belongs to the SVO (Subject, Verb, Object) language family where in a sentence the subject comes first, the verb second, and the object third. Spanish, Italian, Portuguese, German (in conventional typology) and Chinese also belong to the SVO language family, and our approach achieves over 70% relative accuracy for these languages. On the other hand, Japanese belongs to the SOV (Subject, Object, Verb) language family and Arabic belongs to the VSO (Verb, Subject, Object) language family, and our approach achieves lower relative accuracy for these two languages.

6 Related Work

There are a few weakly supervised cross-lingual RE approaches. Kim et al. (2010) and Kim and Lee (2012) project annotated English RE data to Korean to create weakly labeled training data via aligned parallel corpora. Faruqui and Kumar (2015) translates a target-language sentence into English, performs RE in English, and then projects the relation phrases back to the target-language sentence. Zou et al. (2018) proposes an adversarial feature adaptation approach for cross-lingual relation classification, which uses a machine translation system to translate source-language sentences into target-language sentences. Unlike the existing approaches, our approach does not require aligned parallel corpora or machine translation systems. There are also several multilingual RE approaches, e.g., Verga et al. (2016); Min et al. (2017); Lin et al. (2017), where the focus is to improve monolingual RE by jointly modeling texts in multiple languages.

Many cross-lingual word embedding models have been developed recently (Upadhyay et al., 2016; Ruder et al., 2017). An important application of cross-lingual word embeddings is to enable cross-lingual model transfer. In this paper, we apply the bilingual word embedding mapping technique in (Mikolov et al., 2013b) to cross-lingual RE model transfer. Similar approaches have been applied to other NLP tasks such as dependency parsing (Guo et al., 2015), POS tagging (Gouws and Søgaard, 2015) and named entity recognition (Ni et al., 2017; Xie et al., 2018).

7 Conclusion

In this paper, we developed a simple yet effective neural cross-lingual RE model transfer approach, which has very low resource requirements (a small bilingual dictionary with 1K word pairs) and can be easily extended to a new language. Extensive experiments for 7 target languages across a variety of language families on both in-house and open datasets show that the proposed approach achieves very good performance (up to 79% of the accuracy of the supervised target-language RE model), which provides a strong baseline for building cross-lingual RE models with minimal resources.

Acknowledgments

We thank Mo Yu for sharing their ACE05 English data split and the anonymous reviewers for their valuable comments.

References

  • M. Artetxe, G. Labaka, and E. Agirre (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2289–2294. External Links: Link, Document Cited by: §3.2.1, §5.3.2.
  • M. Artetxe, G. Labaka, and E. Agirre (2017) Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 451–462. External Links: Link, Document Cited by: §3.2.2, 3rd item.
  • M. Artetxe, G. Labaka, and E. Agirre (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 789–798. External Links: Link, Document Cited by: §3.2.2, 4th item.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. External Links: ISSN 1532-4435, Link Cited by: §3.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. In International Conference on Learning Representations, External Links: Link Cited by: §3.2.2.
  • C. dos Santos, B. Xiang, and B. Zhou (2015) Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 626–634. External Links: Document, Link Cited by: §1, §4.2.2.
  • M. Faruqui and C. Dyer (2014) Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 462–471. External Links: Link Cited by: §3.
  • M. Faruqui and S. Kumar (2015) Multilingual open relation extraction using cross-lingual projection. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1351–1356. External Links: Document, Link Cited by: §1, §6.
  • M. R. Gormley, M. Yu, and M. Dredze (2015) Improved relation extraction with feature-rich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1774–1784. External Links: Link, Document Cited by: §5.2, §5.2, Table 1.
  • S. Gouws, Y. Bengio, and G. Corrado (2015) BilBOWA: fast bilingual distributed representations without word alignments. In Proceedings of the 32nd International Conference on Machine Learning, pp. 748–756. External Links: Link Cited by: §3.
  • S. Gouws and A. Søgaard (2015) Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1386–1390. External Links: Link Cited by: §6.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. NEURAL NETWORKS 18 (5-6), pp. 602–610. Cited by: §4.2.1.
  • J. Guo, W. Che, D. Yarowsky, H. Wang, and T. Liu (2015) Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 1234–1244. External Links: Link Cited by: §6.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §4.2.1.
  • N. Kambhatla (2004) Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, Stroudsburg, PA, USA. External Links: Link, Document Cited by: §1.
  • S. Kim, M. Jeong, J. Lee, and G. G. Lee (2010) A cross-lingual annotation projection approach for relation detection. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, Stroudsburg, PA, USA, pp. 564–571. External Links: Link Cited by: §1, §6.
  • S. Kim and G. G. Lee (2012) A graph-based cross-lingual projection approach for weakly supervised relation extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, Stroudsburg, PA, USA, pp. 48–53. External Links: Link Cited by: §1, §6.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), ICLR ’15. External Links: Link Cited by: §5.2.
  • Q. Li and H. Ji (2014) Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 402–412. External Links: Document, Link Cited by: §1.
  • Y. Lin, Z. Liu, and M. Sun (2017) Neural relation extraction with multi-lingual attention. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 34–43. External Links: Document, Link Cited by: §6.
  • T. Luong, H. Pham, and C. D. Manning (2015) Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159. External Links: Document, Link Cited by: §3.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. External Links: Link Cited by: §3.1, §3.
  • T. Mikolov, Q. V. Le, and I. Sutskever (2013b) Exploiting similarities among languages for machine translation. CoRR abs/1309.4168. External Links: Link Cited by: §3.2, §3, §3, §6.
  • B. Min, Z. Jiang, M. Freedman, and R. Weischedel (2017) Learning transferable representation for bilingual relation extraction via convolutional neural networks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 674–684. External Links: Link Cited by: §6.
  • M. Miwa and M. Bansal (2016) End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1116. External Links: Document, Link Cited by: §1.
  • T. H. Nguyen and R. Grishman (2016) Combining neural networks and log-linear models to improve relation extraction. In Proceedings of IJCAI Workshop on Deep Learning for Artificial Intelligence (DLAI), External Links: Link Cited by: §1, §5.2, §5.2, Table 1.
  • J. Ni, G. Dinu, and R. Florian (2017) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1470–1480. External Links: Document, Link Cited by: §3.1, §6.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §3.
  • B. Plank and A. Moschitti (2013) Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 1498–1507. External Links: Link Cited by: §5.2.
  • S. Ruder, I. Vulić, and A. Søgaard (2017) A survey of cross-lingual embedding models. CoRR abs/1706.04902. External Links: Link, 1706.04902 Cited by: §3, §6.
  • S. L. Smith, D. H. P. Turban, S. Hamblin, and N. Y. Hammerla (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3.2.1, §5.3.2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435, Link Cited by: §5.2.
  • S. Upadhyay, M. Faruqui, C. Dyer, and D. Roth (2016) Cross-lingual models of word embeddings: an empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1661–1670. External Links: Document, Link Cited by: §3, §6.
  • P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum (2016) Multilingual relation extraction using compositional universal schema. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 886–896. External Links: Document, Link Cited by: §6.
  • C. Walker, S. Strassel, J. Medero, and K. Maeda (2006) ACE 2005 multilingual training corpus. External Links: Link Cited by: 3rd item, §5.
  • J. Xie, Z. Yang, G. Neubig, N. A. Smith, and J. Carbonell (2018) Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 369–379. External Links: Link Cited by: §6.
  • C. Xing, D. Wang, C. Liu, and Y. Lin (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1006–1011. External Links: Link, Document Cited by: §3.2.1, §5.3.2.
  • D. Zelenko, C. Aone, and A. Richardella (2003) Kernel methods for relation extraction. The Journal of Machine Learning Research 3, pp. 1083–1106. External Links: ISSN 1532-4435, Link Cited by: §1.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 2335–2344. External Links: Link Cited by: §1, §4.2.2.
  • B. Zou, Z. Xu, Y. Hong, and G. Zhou (2018) Adversarial feature adaptation for cross-lingual relation classification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 437–448. External Links: Link Cited by: §1, §6.