Contrastive Language Adaptation for Cross-Lingual Stance Detection

  • 2019-10-04 16:01:23
  • Mitra Mohtarami, James Glass, Preslav Nakov
  • 1

Abstract

We study cross-lingual stance detection, which aims to leverage labeled datain one language to identify the relative perspective (or stance) of a givendocument with respect to a claim in a different target language. In particular,we introduce a novel contrastive language adaptation approach applied to memorynetworks, which ensures accurate alignment of stances in the source and targetlanguages, and can effectively deal with the challenge of limited labeled datain the target language. The evaluation results on public benchmark datasets andcomparison against current state-of-the-art approaches demonstrate theeffectiveness of our approach.

 

Quick Read (beta)

Contrastive Language Adaptation for Cross-Lingual Stance Detection

Mitra Mohtarami1, James Glass1, Preslav Nakov2
1MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
2Qatar Computing Research Institute, HBKU, Doha, Qatar
{mitra,glass}@csail.mit.edu; [email protected]
Abstract

We study cross-lingual stance detection, which aims to leverage labeled data in one language to identify the relative perspective (or stance) of a given document with respect to a claim in a different target language. In particular, we introduce a novel contrastive language adaptation approach applied to memory networks, which ensures accurate alignment of stances in the source and target languages, and can effectively deal with the challenge of limited labeled data in the target language. The evaluation results on public benchmark datasets and comparison against current state-of-the-art approaches demonstrate the effectiveness of our approach.

Contrastive Language Adaptation for Cross-Lingual Stance Detection


Mitra Mohtarami1, James Glass1, Preslav Nakov2 1MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA 2Qatar Computing Research Institute, HBKU, Doha, Qatar {mitra,glass}@csail.mit.edu; [email protected]

1 Introduction

The rise of social media has enabled the phenomenon of “fake news,” which could target specific individuals and can be used for deceptive purposes (Lazer et al., 2018; Vosoughi et al., 2018). As manual fact-checking is a time-consuming and tedious process, computational approaches have been proposed as a possible alternative (Popat et al., 2017; Wang, 2017; Mihaylova et al., 2018), based on information sources such as social media (Ma et al., 2017), Wikipedia (Thorne et al., 2018), and knowledge bases (Huynh and Papotti, 2018). Fact-checking is a multi-step process (Vlachos and Riedel, 2014): (i) checking the reliability of media sources, (ii) retrieving potentially relevant documents from reliable sources as evidence for each target claim, (iii) predicting the stance of each document with respect to the target claim, and finally (iv) making a decision based on the stances from (iii) for all documents from (ii).

Here, we focus on stance detection which aims to identify the relative perspective of a document with respect to a claim, typically modeled using labels such as agree, disagree, discuss, and unrelated.

Current approaches to stance detection (Bar-Haim et al., 2017; Dungs et al., 2018; Kochkina et al., 2018; Inkpen et al., 2017; Mohtarami et al., 2018) are well-studied in mono-lingual settings, in particular for English, but less attention has been paid to other languages and cross-lingual settings. This is partially due to domain differences and to the lack of training data in other languages.

We aim to bridge this gap by proposing a cross-lingual model for stance detection. Our model leverages resources of a source language (e.g., English) to train a model for a target language (e.g., Arabic). Furthermore, we propose a novel contrastive language adaptation approach that effectively aligns samples with similar or dissimilar stances across source and target languages using task-specific loss functions. We apply our language adaptation approach to memory networks (Sukhbaatar et al., 2015), which have been found effective for mono-lingual stance detection (Mohtarami et al., 2018).

Our model can explain its predictions about stances of documents against claims in a different/target language by extracting relevant text snippets from the documents of the target language as evidence. We use evidence extraction as a measure to evaluate the trasferability of our model. This is because more accurate evidence extraction indicates that the model can better learn semantic relations between claims and pieces of evidence, and consequently can better transfer knowledge to the target language.

The contributions of this paper are summarized as follows:

  • We propose a novel language adaptation approach based on contrastive stance alignment that aligns the class labels between source and target languages for effective cross-lingual stance detection.

  • Our model is able to extract accurate text snippets as evidence to explain its predictions in the target language (results are in Section 4.2).

  • To the best of our knowledge, this is the first work on cross-lingual stance detection.

We conducted our experiments on English (as source language) and Arabic (as target language). In particular, we used the Fake News Challenge dataset (Hanselowski et al., 2018) as source data and an Arabic benchmark dataset (Baly et al., 2018) as target data. The evaluation results have shown 2.7 and 4.0 absolute improvement in terms of macro-F1 and weighted accuracy for stance detection over the current state-of-the-art mono-lingual baseline, and 11.4, 14.9, 16.1, 12.9, and 13.1 points of absolute improvement in terms of precision at ranks 15 for extracting evidence snippets respectively. Furthermore, a key finding in our investigation is that, in contrast to other tasks (Devlin et al., 2018; Peters et al., 2018), pre-training with large amounts of source data is less effective for cross-lingual stance detection. We show that this is because pre-training can considerably bias the model toward the source language.

Figure 1: The architecture of our cross-lingual memory network for stance detection.

2 Method

Assume that we are given a training dataset for a source language, 𝒟s, which contains a set of triplets as follows: 𝒟s={((cis,dis),yis)}i=1N, where N is the number of source samples, (cis,dis) is a pair of claim cis and document dis, and yisY, Y={𝑎𝑔𝑟𝑒𝑒,𝑑𝑖𝑠𝑎𝑔𝑟𝑒𝑒,𝑑𝑖𝑠𝑐𝑢𝑠𝑠,𝑢𝑛𝑟𝑒𝑙𝑎𝑡𝑒𝑑}, is the corresponding label indicating the stance of the document with respect to the claim. In addition, we are given a very small training dataset for the target language 𝒟t={((cit,dit),yit)}i=1M, where M (MN) is the number of target samples, and (cit,dit) is a pair of claim and document in the target language with stance label yitY. In reality, (i)  the size of the target dataset is very small, (ii)  claims and documents in the source and target languages are from different domains, and (iii)  the only commonality between the source and target datasets is in their stance labels, i.e., yis,yitY.

We develop a language adaptation approach to effectively use the commonality between the source and the target datasets in their label space and to deal with the limited size of the target training data. We apply our language adaptation approach to end-to-end memory networks (Sukhbaatar et al., 2015) for cross-lingual stance detection.

We use memory networks as they have achieved state-of-the-art performance for mono-lingual stance detection (Mohtarami et al., 2018). However, our language adaptation approach can be applied to any other type of neural network. The architecture of our cross-lingual stance detection model is shown in Figure 1. It has two main components: (iMemory Networks indicated with two dashed boxes for the source and the target languages, and (iiContrastive Language Adaptation component. In what follows, we first explain our memory network model for cross-lingual stance detection (Section 2.1) and then present our contrastive language adaptation approach (Section 2.2).

2.1 Memory Networks

Memory networks are designed to remember past information (Sukhbaatar et al., 2015) and have been successfully applied to NLP tasks ranging from dialog (Bordes and Weston, 2016) to question answering (Xiong et al., 2016) and mono-lingual stance detection (Mohtarami et al., 2018). They include components that can potentially use different learning models and inference strategies. Our source and target memory networks follow the same architecture as depicted in Figure 1:

A memory network consists of six components. The network takes as input a document d and a claim c and encodes them into the input space I. These representations are stored in the memory component M for future processing. The relevant parts of the input are identified in the inference component F, and used by the generalization component G to update the memory M. Finally, the output component O generates an output from the updated memory, and encodes it to a desired format in the response component R using a prediction function, e.g., softmax for classification tasks. We elaborate on these components below.

Input representation component I:

It encodes documents and claims into corresponding representations. Each document d is divided into a sequence of paragraphs X=(x1,,xl), where each xj is encoded as 𝐦j using an LSTM network, and as 𝐧j using a CNN; these representations are stored in the memory component M. Note that while LSTMs are designed to capture and memorize their inputs (Tan et al., 2016), CNNs emphasize the local interaction between individual words in sequences, which is important for obtaining good representation (Kim, 2014).

Thus, our I component uses both LSTM and CNN representations. It also uses separate LSTM and CNN with their own parameters to represent each input claim c as 𝐜𝑙𝑠𝑡𝑚 and 𝐜𝑐𝑛𝑛, respectively.

We consider each paragraph as a single piece of evidence because a paragraph usually represents a coherent argument, unified under one or more inter-related topics. We thus use the terms paragraph and evidence interchangeably.

Inference component F:

Our inference component computes LSTM- and CNN-based similarity between each claim c and evidence xj as follows:

Pj𝑙𝑠𝑡𝑚 =𝐜𝑙𝑠𝑡𝑚×𝐌×𝐦j,j
Pj𝑐𝑛𝑛 =𝐜𝑐𝑛𝑛×𝐌×𝐧j,j

where P.𝑙𝑠𝑡𝑚 and P.𝑐𝑛𝑛 indicate claim-evidence similarity based on LSTM and CNN respectively, 𝐜𝑙𝑠𝑡𝑚q and 𝐦jd are LSTM representations of c and xj respectively, 𝐜𝑐𝑛𝑛q and 𝐧jd are the corresponding CNN representations, and 𝐌q×d and 𝐌q×d are similarity matrices trained to map claims and paragraphs into the same space with respect to their LSTM and CNN representations. The rationale behind using these similarity matrices is that, in the memory network, we seek a transformation of the input claim, i.e., 𝐜×𝐌, in order to obtain the closest evidence to the claim.

Additionally, we compute another semantic similarity vector, Pj𝑡𝑓𝑖𝑑𝑓 , by applying a cosine similarity between the TF.IDF (Spärck Jones, 2004) representation of xj and c. This is particularly useful for stance detection as it can help filtering out unrelated pieces of evidence.

Memory M and Generalization G components:

Our memory component stores representations and the generalization component improves their quality by filtering out unrelated evidence. For example, the LSTM representations of paragraphs, 𝐦j,j, are updated using the claim-evidence similarity Pj𝑡𝑓𝑖𝑑𝑓 as follows: 𝐦j=𝐦jPj𝑡𝑓𝑖𝑑𝑓,j. This transformation will help filter out unrelated evidence with respect to claims. The updated 𝐦j in conjunction with 𝐜𝑙𝑠𝑡𝑚 are used by the inference component F to compute Pj𝑙𝑠𝑡𝑚,j as we explained above. Then, Pj𝑙𝑠𝑡𝑚 are in turn used to update CNN representations in memory as follows: 𝐧j=𝐧jPj𝑙𝑠𝑡𝑚,j. Finally, the updated 𝐧j and 𝐜𝑐𝑛𝑛 are used to compute Pj𝑐𝑛𝑛.

Output representation component O:

This component computes the output of the memory M by concatenating the average vector of the updated 𝐧j with the maximum and average of claim-evidence similarity vectors Pj𝑡𝑓𝑖𝑑𝑓, Pj𝑙𝑠𝑡𝑚 and Pj𝑐𝑛𝑛. The maximum helps to identify parts of documents that are most similar to claims, while the average estimates the overall document-claim similarity.

Response generation component R:

This component computes the final stance of a document with respect to a claim. For this, the output of component O is concatenated with 𝐜𝑙𝑠𝑡𝑚 and 𝐜𝑐𝑛𝑛 and fed into a softmax to predict the stance of the document with respect to the claim.

All the memory network parameters, including those of CNN and LSTM in the I component, the similarity matrices 𝐌 and 𝐌 in F, and the classifier parameters in R, are jointly learned during the training process with our language adaptation.

2.2 Contrastive Language Adaptation

Figure 2: Illustration of stance equal alignment (SEA), stance separation alignment (SSA), and classification alignment (CA) constraints. Different shapes indicate different stance labels and colors specify source (blue) and target (green) languages.
Algorithm 1. Cross-Lingual Stance Detection Model
Inputs:
   {(dis,cis),yis}1ls: set of pairs of documents dis and claims cis in the source language, where yisY, Y={agree,disagree,discuss,unrelated}
   {(dit,cit),yit}1lt: small set of pairs of documents dit and claims cit in the target language with yitY
Output:
   {(dit,cit)yit}lt+1M: assign stance labels yitY to given unlabeled target pairs
Cross-Lingual model:
1 Create the sets {((dis,cis),(djt,cjt)),(yis,yi)} and {((djt,cjt),(dis,cis)),(yjt,yj)}, where y=1 if the source and the target have the same label, and y=0 otherwise.
2 Loop for e epochs:
3    pass (dis,cis) to the source memory network to
   create its representation ris.
4    pass (djt,cjt) to the target memory network to
   create its representation rjt.
5    pass (ris,yis) to the classification to compute its
   classification loss LCAis.
6    pass (ris,rit,yi) to the language adaptation to
   compute the stance alignment loss LCSAi.
7    compute total loss Lis=(1-α)LCAis+αLCSAi.
8 repeat steps 2-7 with a change in step 5 by passing the target sample (rjt,yjt) to the classification instead of the source sample, and compute its Ljt in step 7.
9 jointly optimize all parameters of the model using the average loss L=mean({Lis}+{Ljt}).
Table 1: Cross-lingual stance detection model.

Memory networks are effective for stance detection in mono-lingual settings (Mohtarami et al., 2018) when there is sufficient training data. However, we show that these models have limited transferability to target languages with limited data. This could be due to discrepancy between the underlying data distributions in the source and target languages. We show that the performance of these networks can be trivially increased when the model, pre-trained on source data, is fine-tuned using small amounts of labeled target data. We further develop contrastive language adaptation that can exploit the labeled source data to perform well on target data.

Our contrastive adaptation approach:

  • encourages pairs (dis,cis) from the source language and (dit,cit) from the target language with the same stance labels (i.e., yis=yit) to be nearby in the embedding space. We call this mapping Stance Equal Alignment (SEA), illustrated with dotted lines in Figure 2. Note that documents and claims in the two languages are often semantically different and are not corresponding translations of each other.

  • encourages pairs (dis,cis) from the source language and (dit,cit) from the target language with different stance labels (i.e., yisyit) to be far apart in the embedding space. We call this mapping Stance Separation Alignment (SSA), shown with dashed lines in Figure 2.

  • encourages pairs (dis,cis) from the source language and (dit,cit) from the target language to be correctly classified as (dis,cis)yis and (dit,cit)yit. We call this Classification Alignment (CA), solid lines in Figure 2.

We make complete use of the stance labels in the cross-lingual setting by parameterizing our model according to the distance between the source and the target samples in the embedding space.

For stance equal alignment (SEA) constraint, the objective is to minimize the distance between pairs of source and target data with the same stance labels. We achieve this using the following loss:

LSEA={𝙳(𝚐(dis,cis)-𝚐(dit,cit))2,yis=yit0,otherwise, (1)

where g maps its input pair to an embedding space using our memory network or any mono-lingual model, and D computes Euclidean distance.

For stance separation alignment (SSA), the goal is to maximize the distance between pairs with different stance labels. We use the following loss:

LSSA={max(0,m-𝙳(𝚐(dis,cis)-𝚐(dit,cit))2),yisyit0,otherwise, (2)

where we maximize the distance between pairs with different stance labels up to the margin m.

The margin parameter m specifies the extent of separability in the embedding space.

We can further use any classification loss to enforce classification alignment (CA). We use categorical cross-entropy and call it Classification Alignment loss LCA.

We develop our overall language adaptation loss, named Contrastive Stance Alignment loss, LCSA, by combining LSEA and LSSA as follows:

LCSA=LSEA+LSSA. (3)

Finally, the total loss of our cross-lingual stance detection model is defined as follows:

L=(1-α)LCA+αLCSA, (4)

where the α parameter controls the balance between classification and language adaptation losses, which we optimize on the validation dataset.

Information Flow:

Our overall cross-lingual model for stance detection is shown in Figure 1, and a summary of the algorithm is presented in Table 1. As Figure 1 shows, each source and target pairs are passed to the source and to the target memory networks to obtain their corresponding representations (Lines 3-4 in Table 1). The source representation and its gold stance label are passed to the classifier to compute the classification loss (Line 5). In addition, the source and the target representations in conjunction with a binary parameter (y, which is 1 if the source and the target have the same stance label, and 0 otherwise) are passed to the language adaptation component to compute the contrastive stance alignment loss LCSA (Line 6). Finally, the total loss is computed based on Equation (4) (Line 7).

The classifier also uses labeled target samples to create a shared embedding space and to fine-tune itself with respect to the target language. For this purpose, we repeat the above steps by switching the target and the source pipelines (Line 8). Finally, we compute the average of all losses and we use it to optimize the parameters of our model (Line 9).

Pre-training for Language Adaptation:

Pre-training has been found effective in many language adaptation settings (Tzeng et al., 2017). To investigate the effect of pre-training, we first pre-train the source memory network and the classifier using 𝒟s (only the top pipeline in Figure 1), and then we apply language adaptation with the full model.

Methods Weigh. Acc. Acc. Macro-F1 F1 (agree, disagree, discuss, unrelated)
1.    All-unrelated 34.8 68.1 20.3 0 / 0 / 0 / 81.0
2.    All-agree 40.2 15.6 6.7 27.0 / 0 / 0 / 0
\hdashline[1.5pt/2pt] 3.    Gradient Boosting (Baly et al., 2018) 55.6 72.4 41.0 60.4 / 9.0 / 10.4 / 84.0
4.    TFMLP (Riedel et al., 2017) 49.3 66.0 37.1 47.0 / 7.8 / 13.4 / 80.0
5.    EnrichedMLP (Hanselowski et al., 2018) 55.1 70.5 41.3 59.1 / 9.2 / 14.1 / 82.3
6.    Ensemble  (Baird et al., 2017) 53.6 71.6 37.2 57.5 / 2.1 / 6.2 / 83.2
7.    MN targettarget (Mohtarami et al., 2018) 55.3 70.9 41.7 60.0 / 15.0 / 08.5 / 83.1
8.    MN sourcetarget 53.2 64.2 36.0 40.4 / 02.0 / 19.1 / 82.4
9.    MN (source,target)target 57.3 65.0 42.5 58.0 / 12.2 / 20.9 / 79.0
10.  ADMN (adversarial) 58.6 58.4 43.4 60.2 / 16.1 / 24.2 / 72.9
11.  CLMN (contrastive) 61.3 71.6 45.2 65.1 / 11.6 / 20.5 / 83.7
Table 2: Evaluation results on the target Arabic test dataset.

3 Experiments

Data and Settings.

As source data, we use the Fake News Challenge dataset11 1 Available at www.fakenewschallenge.org which contains 75.4K claim-document pairs in English with {agree, disagree, discuss, unrelated} as stance labels. As target dataset, we use 3K Arabic claim-document pairs developed in (Baly et al., 2018).22 2 Available at http://groups.csail.mit.edu/sls/downloads/

We perform 5-fold cross-validation on the Arabic dataset, using each fold in turn for testing, and keeping 80% of the remaining data for training and 20% for development. We use 300-dimensional pretrained cross-lingual Wikipedia word embeddings from MUSE (Conneau et al., 2017). We use 300-dimensional units for the LSTM and 100 feature maps with filter width of 5 for the CNN. We consider the first 9 paragraphs per document, which is the median number of paragraphs in source documents. We optimize all hyper-parameters on validation data using Adam (Kingma and Ba, 2014).

Evaluation Measures.

We use the followings:

  • Accuracy: The fraction of correctly classified examples. For multi-class classification, accuracy is equivalent to micro-average F1 (Manning et al., 2008).

  • Macro-F1: The average of F1 scores that were calculated for each class separately.

  • Weighted Accuracy: This is a hierarchical metric, which first awards 0.25 points if the model correctly predicts a document-claim pair as related33 3 related = {agree, disagree, discuss} or unrelated. If it is related, 0.75 additional points are assigned if the model correctly predicts the pair as agree, disagree, or discuss. The goal of this weighting schema is to balance out the large number of unrelated examples (Hanselowski et al., 2018).

Baselines.

We consider the following baselines:

  • Heuristic: Given the imbalanced nature of our data, we use two heuristic baselines where all test examples are labeled as unrelated or agree. The former is a majority class baseline favoring accuracy and macro-F1, while the latter is better for weighted accuracy.

  • Gradient Boosting (Baly et al., 2018): This is a Gradient Boosting classifier with n-gram features as well as indicators for refutation and polarity.

  • TFMLP (Riedel et al., 2017): This is an MLP with normalized bag-of-words features enriched with a single TF.IDF-based similarity feature for each claim-document pair.

  • EnrichedMLP (Hanselowski et al., 2018): This model combines five MLPs, each with six hidden layers and advanced features from topic models, latent semantic analysis, etc.

  • Ensemble (Baird et al., 2017): It is an ensemble based on weighted average of a deep convolutional neural network and a gradient-boosted decision tree - the best model at the Fake News Challenge.

  • Mono-lingual Memory Network (Mohtarami et al., 2018): This model is the current state-of-the-art for stance detection on our source dataset. It is an end-to-end memory network which incorporates both CNN and LSTM for prediction.

  • Adversarial Memory Network: We use adversarial domain adaptation (Ganin et al., 2016) instead of contrastive language adaptation in our cross-lingual memory network.

Results.

Table 2 shows the performance of all models on the target Arabic test set. The All-unrelated and All-agree baselines perform poorly across evaluation measures; All-unrelated performs better than All-agree because unrelated is the dominant class (68% of examples).

Rows 36 show that Gradient Boosting and EnrichedMLP yield similar results, while TFMLP performs the worst. We attribute this to the advanced features used in the two former models. Gradient Boosting has better accuracy due to its better performance on the dominant class. Note that Ensemble performs poorly because of the limited labeled data, which is insufficient to train a good CNN model.

Rows 79 show the results for the mono-lingual memory network (MN) from (Mohtarami et al., 2018). The performance of this model when trained on Arabic data only (row 7) is comparable to previous baselines (rows 36). But, it shows poor performance if trained on source English data and tested on Arabic test data (row 8). The model performs best (in terms of weighted accuracy and F1) if first pretrained on source data and then fine-tuned on target training data (row 9).

Row 10 in Table 2 shows the results for adversarial memory network (ADMN). It improves the performance of mono-lingual MN on weighted accuracy and F1, but its accuracy significantly drops. This is because adversarial approaches give higher weights to samples of the majority class (i.e., unrelated) which makes classification more challenging for the discriminator (Montahaei et al., 2018).

Row 11 shows the results for our cross-lingual memory network (CLMN) with (α=.7); α controls the balance between classification and language adaptation losses (tuned using validation data). CLMN outperforms other baselines in terms of weighted accuracy and F1 while showing comparable accuracy. We show that the improvement is due to language adaptation being able to effectively transfer knowledge from the source to target language (see Section 4.2).

The last column in Table 2 shows that unrelated examples are the easiest ones. Also, although the agree and the discuss classes have roughly the same size, i.e., 474 and 409 examples, respectively, the results for agree are notably higher. This is mainly because the documents that discuss a claim often share the same topic with the claim, but they do not take a stance. In addition, the disagree examples are the most difficult ones; this class is by far the smallest one, with only 87 examples.

Figure 3: Impact of pretraining on the macro-F1 across α values. The Y axes show average results on the validation datasets with 5-fold cross-validation.
Methods Weigh. Acc. Acc. Macro-F1
CLMN (with pretraining) 60.2 69.8 43.2
CLMN (without pretraining) 61.3 71.6 45.2
Table 3: CLMN results on the target test dataset.

4 Discussion

4.1 Effect of Pretraining

Table 3 shows CLMN without pretraining (α=.7) performs better on target test data than CLMN with pretraining (α=.3), recall that α controls the balance between classification and language adaptation losses. Our further analysis shows that pretraining biases the model toward the source language. Figure 3 shows the impact of pretraining on macro-average F1 score for CLMN across different values of α on validation data. While the model without pretraining achieves its best performance with a large α (α=0.7), the model with pretraining performs well with a smaller α (α=0.3). This suggests that our model can capture the characteristics of the source dataset via pretraining when using small supervision from language adaptation (i.e., small α). However, pretraining introduces bias to the source space and the performance drops when larger weights are given to language adaptation; see the results with pretraining in Figure 3.

4.2 Assessment of Model Transferability

The improvements of CLMN model over the monolingual MN models that use the target only, the source only, or both the target and the source (rows 79 in Table 2 respectively) indicate its transferability. We further estimate transferability by measuring the accuracy of the models in extracting evidence that support their predictions. A more accurate model should better transfer knowledge to the target language by accurately learning the relations between claims and pieces of evidence.

Our target data has annotations (in terms of binary labels) for each piece of evidence (here paragraph) that indicate whether it is a rationale for the agree or for the disagree class. Moreover, our inference component (I) has a claim-evidence similarity vector, Pjcnn, which can be used to rank pieces of evidence from the target document against the target claim.

We use the gold data and the rankings produced by our model in order to measure its precision in extracting evidence that supports its predictions. Figure 4 shows that our CLMN model achieves precision of 40.2, 55.9, 66.0, 72.7, and 79.2 at ranks 15 respectively, and outperforms mono-lingual MN models. This indicates that CLMN can better generalize and transfer knowledge to the target language through learning relations between pieces of evidence and claims.

Figure 4: Transferability of our cross-lingual model.

4.3 Effect of Language Adaptation

Figures 4(a) and 4(b) show the classification (LCA) versus contrastive stance (LCSA) losses obtained from our best language adaptation model (i.e., without pretraining) across training epochs and α values. The results are averaged on validation data when performing 5-fold cross-validation. As Figure 4(a) shows, there is greater reduction in the classification loss for smaller values of α, i.e., when classification loss contributes more to the overall loss; see Equation (4). On the other hand, Figure 4(b) shows that the CSA loss decreases with larger values of α as the model pays more attention to the CSA loss; see the red and green lines in Figure 4(b). These results indicate that our language adaptation model can find a good balance between the classification loss and the CSA loss, with the value of α=.7 yielding the best performance.

(a) Classification loss
(b) CSA loss
Figure 5: Classification vs. contrastive stance alignment (CSA) losses across training epochs and α values.
(a) without pretraining
(b) with pretraining
Figure 6: Classification loss vs. contrastive stance alignment loss (CSA) vs. total loss during training.

Figure 6 compares the classification LCA, contrastive stance LCSA, and total L losses obtained by our CLMN model on the validation dataset across training epochs when the loss weight parameter (α) is set to its best value. Figures 5(a) and  5(b) show the results without and with pretraining for α=.7 and α=.3 respectively. Without pretraining (Figure 5(a)), the classification (light-blue line) and CSA (dark-blue line) losses both decrease up to epoch 20, after which the classification loss keeps decreasing, but the CSA loss starts increasing. With pretraining (Figure 5(b)), the CSA loss rapidly decreases for the first 10 epochs (even though it has a small effect as α=.3), and then continues with a smooth trend. This is because, during the initial training epochs, the model is biased to the source embedding space due to pretraining, and therefore the source and the target examples are far from each other. Then, our language adaptation model aligns the source and the target examples to form a much better shared embedding space and this alignment strategy yields a rapid decrease of the CSA loss in the first few epochs. Yet, in contrast to the CSA loss, the classification loss increases in the first few epochs. This is because the model enforces alignment between the source and the target samples due to the large distances. Finally, the total loss (orange line) indicates a good balance between the classification and the language adaptation losses, and it consistently decreases during training.

5 Related Work

Domain Adaptation.

Previous work has presented several domain adaptation techniques. Unsupervised domain adaptation approaches (Ganin and Lempitsky, 2015; Long et al., 2016; Muandet et al., 2013; Gong et al., 2012) attempt to align the distribution of features in the embedding space mapped from the source and the target domains. A limitation of such approaches is that, even with perfect alignment, there is no guarantee that the same-label examples from different domains would map nearby in the embedding space. Supervised domain adaptation (Daumé,III and Marcu, 2006; Becker et al., 2013; Bergamo and Torresani, 2010) attempts to encourage same-label examples from different domains to map nearby in the embedding space. While supervised approaches perform better than unsupervised ones, recent work (Motiian et al., 2017) has demonstrated superior performance by additionally encouraging class separation, meaning that examples from different domains and with different labels should be projected as far apart as possible in the embedding space. Here, we combined both types of alignments for stance detection.

Stance Detection.

Mohammad et al. (2016) and Zarrella and Marsh (2016) worked on stances regarding target propositions, e.g., entities or events, as in-favor, against, or neither. Most commonly, stance detection has been defined with respect to a claim as agree, disagree, discuss or unrelated. Previous work mostly developed the models with rich hand-crafted features such as words, word embeddings, and sentiment lexicons (Riedel et al., 2017; Baird et al., 2017; Hanselowski et al., 2018). More recently, Mohtarami et al. (2018) presented a mono-lingual and feature-light memory network for stance detection. In this paper, we built on this work to extend previous efforts in stance detection to a cross-lingual setting.

6 Conclusion and Future Work

We proposed an effective language adaptation approach to align class labels in source and target languages for accurate cross-lingual stance detection. Moreover, we investigated the behavior of our model in details and we have shown that it offers sizable performance gains over a number of competing approaches. In future, we will extend our language adaptation model to document retrieval and check-worthy claim detection tasks.

Acknowledgments

We thank the anonymous reviewers for their insightful comments. This research was supported in part by the Qatar Computing Research Institute, HBKU44 4 This research is part of the Tanbih project (available at http://tanbih.qcri.org/) which aims to limit the effect of “fake news”, propaganda and media bias by making users aware of what they are reading. and DSTA of Singapore.

References

  • S. Baird, D. Sibley, and Y. Pan (2017) Talos targets disinformation with fake news challenge victory. Note: https://blog.talosintelligence.com/2017/06/ talos-fake-news-challenge.html Cited by: Table 2, 5th item, §5.
  • R. Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, and P. Nakov (2018) Integrating stance detection and fact checking in a unified corpus. In Proceedings of NAACL-HLT, New Orleans, LA, USA. Cited by: §1, Table 2, 2nd item, §3.
  • R. Bar-Haim, I. Bhattacharya, F. Dinuzzo, A. Saha, and N. Slonim (2017) Stance classification of context-dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’17, pp. 251–261. Cited by: §1.
  • C. J. Becker, C. M. Christoudias, and P. Fua (2013) Non-linear domain adaptation with boosting. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 485–493. Cited by: §5.
  • A. Bergamo and L. Torresani (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems 23, pp. 181–189. Cited by: §5.
  • A. Bordes and J. Weston (2016) Learning end-to-end goal-oriented dialog. CoRR abs/1605.07683. Cited by: §2.1.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §3.
  • H. Daumé,III and D. Marcu (2006) Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research 26 (1), pp. 101–126. External Links: ISSN 1076-9757 Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • S. Dungs, A. Aker, N. Fuhr, and K. Bontcheva (2018) Can rumour stance alone predict veracity?. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, pp. 3360–3370. Cited by: §1.
  • Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37, ICML’15, pp. 1180–1189. Cited by: §5.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17 (1), pp. 2096–2030. External Links: ISSN 1532-4435 Cited by: 7th item.
  • B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation.. See conf/cvpr/2012, pp. 2066–2073. External Links: ISBN 978-1-4673-1226-4 Cited by: §5.
  • A. Hanselowski, A. PVS, B. Schiller, F. Caspelherr, D. Chaudhuri, C. M. Meyer, and I. Gurevych (2018) A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the International Conference on Computational Linguistics, COLING ’18, Santa Fe, NM, USA, pp. 1859–1874. Cited by: §1, Table 2, 3rd item, 4th item, §5.
  • V. Huynh and P. Papotti (2018) Towards a benchmark for fact checking with knowledge bases. In Companion Proceedings of the The Web Conference 2018, WWW ’18, Lyon, France, pp. 1595–1598. Cited by: §1.
  • D. Inkpen, X. Zhu, and P. Sobhani (2017) A dataset for multi-target stance detection.. See conf/eacl/2017-2, pp. 551–557. Cited by: §1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’14, Doha, Qatar, pp. 1746–1751. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. Cited by: §3.
  • E. Kochkina, M. Liakata, and A. Zubiaga (2018) All-in-one: multi-task learning for rumour verification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, pp. 3402–3413. Cited by: §1.
  • D. M.J. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein, E. A. Thorson, D. J. Watts, and J. L. Zittrain (2018) The science of fake news. Science 359 (6380), pp. 1094–1096. External Links: ISSN 0036-8075 Cited by: §1.
  • M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 136–144. External Links: ISBN 978-1-5108-3881-9 Cited by: §5.
  • J. Ma, W. Gao, and K. Wong (2017) Detect rumors in microblog posts using propagation structure via kernel learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL ’17, Vancouver, Canada, pp. 708–717. Cited by: §1.
  • C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521865719, 9780521865715 Cited by: 1st item.
  • T. Mihaylova, P. Nakov, L. Marquez, A. Barron-Cedeno, M. Mohtarami, G. Karadzhov, and J. Glass (2018) Fact checking in community forums. In Proceedings of AAAI, New Orleans, LA, USA. Cited by: §1.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) SemEval-2016 task 6: detecting stance in tweets. In Proceedings of SemEval, Berlin, Germany, pp. 31–41. Cited by: §5.
  • M. Mohtarami, R. Baly, J. Glass, P. Nakov, L. Màrquez, and A. Moschitti (2018) Automatic stance detection using end-to-end memory networks. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’18, New Orleans, LA, USA. Cited by: §1, §1, §2.1, §2.2, Table 2, §2, 6th item, §3, §5.
  • E. Montahaei, M. Ghorbani, M. S. Baghshah, and H. R. Rabiee (2018) Adversarial classifier for imbalanced problems. CoRR abs/1811.08812. Cited by: §3.
  • S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In IEEE International Conference on Computer Vision, Cited by: §5.
  • K. Muandet, D. Balduzzi, and B. S. lkopf (2013) Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning, S. Dasgupta and D. McAllester (Eds.), Proceedings of Machine Learning Research, Vol. 28, Atlanta, GA, USA, pp. 10–18. Cited by: §5.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’2018, New Orleans, LA, USA, pp. 2227–2237. Cited by: §1.
  • K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum (2017) Where the truth lies: explaining the credibility of emerging claims on the Web and social media. In Proceedings of the Conference on World Wide Web, WWW ’17, Perth, Australia, pp. 1003–1012. External Links: ISBN 978-1-4503-4914-7 Cited by: §1.
  • B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel (2017) A simple but tough-to-beat baseline for the Fake News Challenge stance detection task. ArXiv:1707.03264. Cited by: Table 2, 3rd item, §5.
  • K. Spärck Jones (2004) IDF term weighting and ir research lessons. Journal of documentation 60 (5), pp. 521–523. Cited by: §2.1.
  • S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. In Proceedings of NIPS, Montreal, Canada, pp. 2440–2448. Cited by: §1, §2.1, §2.
  • M. Tan, C. dos Santos, B. Xiang, and B. Zhou (2016) Improved representation learning for question answer matching. In Proceedings of ACL, Berlin, Germany, pp. 464–473. Cited by: §2.1.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, New Orleans, LA, USA, pp. 809–819. Cited by: §1.
  • E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2017) Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • A. Vlachos and S. Riedel (2014) Fact checking: task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Baltimore, MD, USA, pp. 18–22. Cited by: §1.
  • S. Vosoughi, D. Roy, and S. Aral (2018) The spread of true and false news online. Science 359 (6380), pp. 1146–1151. External Links: ISSN 0036-8075 Cited by: §1.
  • W. Y. Wang (2017) “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL ’17, Vancouver, Canada, pp. 422–426. Cited by: §1.
  • C. Xiong, S. Merity, and R. Socher (2016) Dynamic memory networks for visual and textual question answering. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, New York, NY, USA, pp. 2397–2406. Cited by: §2.1.
  • G. Zarrella and A. Marsh (2016) MITRE at SemEval-2016 Task 6: transfer learning for stance detection. In Proceedings of SemEval, San Diego, CA, USA, pp. 458–463. Cited by: §5.