Toward Dimensional Emotion Detection from Categorical Emotion Annotations

  • 2019-11-06 17:16:26
  • Sungjoon Park, Jiseon Kim, Jaeyeol Jeon, Heeyoung Park, Alice Oh
  • 0


We propose a framework which makes a model predict fine-grained dimensionalemotions (valence-arousal-dominance, VAD) trained on corpus annotated withcoarse-grained categorical emotions. We train a model by minimizing EMDdistances between predicted VAD score distribution and \textit{sorted}categorical emotion distributions in terms of VAD, as a proxy of target VADscore distributions. With our model, we can simultaneously classify a givensentence to categorical emotions as well as predict VAD scores. We usepre-trained BERT-Large and fine-tune on SemEval dataset (11 categoricalemotions) and evaluate on EmoBank (VAD dimensional emotions), in order to showour approach reaches comparable performance to that of the state-of-the-artclassifiers in categorical emotion classification task and significant positivecorrelations with ground truth VAD scores. Also, if one continues training ourmodel with supervision of VAD labels, it outperforms state-of-the-art VADregression models. We further present examples showing our model can annotateemotional words suitable for a given text even those words are not seen ascategorical labels during training.


Quick Read (beta)

Toward Dimensional Emotion Detection
from Categorical Emotion Annotations

Sungjoon Park1,   Jiseon Kim1,   Jaeyeol Jeon,   Heeyoung Park2,   Alice Oh1
1 School of Computing, KAIST, Republic of Korea
2 Department of Psychology, Seoul National University, Republic of Korea
[email protected], [email protected],
[email protected], [email protected], [email protected]

We propose a framework which makes a model predict fine-grained dimensional emotions (valence-arousal-dominance, VAD) trained on corpus annotated with coarse-grained categorical emotions. We train a model by minimizing EMD distances between predicted VAD score distribution and sorted categorical emotion distributions in terms of VAD, as a proxy of target VAD score distributions. With our model, we can simultaneously classify a given sentence to categorical emotions as well as predict VAD scores. We use pre-trained BERT-Large and fine-tune on SemEval dataset (11 categorical emotions) and evaluate on EmoBank (VAD dimensional emotions), in order to show our approach reaches comparable performance to that of the state-of-the-art classifiers in categorical emotion classification task and significant positive correlations with ground truth VAD scores. Also, if one continues training our model with supervision of VAD labels, it outperforms state-of-the-art VAD regression models. We further present examples showing our model can annotate emotional words suitable for a given text even those words are not seen as categorical labels during training.

Toward Dimensional Emotion Detection
from Categorical Emotion Annotations

Sungjoon Park1,   Jiseon Kim1,   Jaeyeol Jeon,   Heeyoung Park2,   Alice Oh1 1 School of Computing, KAIST, Republic of Korea 2 Department of Psychology, Seoul National University, Republic of Korea [email protected], [email protected], [email protected], [email protected], [email protected]

1 Introduction

Figure 1: Overview of our approach. Our model is able to predict VAD distributions conditioned on an input sentence through supervised training with categorical emotion annotations. (sub-fig. a) Specifically, one-hot categorical labels are sorted in terms of V, A, D scores, respectively, to be served as (sparse) label VAD distributions during training. (sub-fig. b) For inference, categorical emotion class can be predicted by picking one having maximum probability of the product of the distributions (sub-fig. c), and continuous VAD score predictions can be made by computing expectation of each distributions. (sub-fig. d)

Humans can feel and express complex emotions beyond the basic emotions (ekman1992argument; plutchik2001nature) in daily basis. To represent these various emotions systematically, a dimensional emotion model like the Valence-Arousal-Dominance (VAD) model is commonly used. (russell1977evidence) This model maps emotional states to orthogonal dimensional VAD space, showing various emotions can be projected into the space with measurable distances from one another. Since dimensional models pose an emotion as real-valued vector in the space, it is likely to account for subtle emotional expressions compared to categorical models which employ a finite number basic emotions. With dimensional VAD models, capturing fine-grained emotions could benefit clinical natural language processing (NLP) researches (Desmet:2013:EDS:2506578.2506869; sahana2015automatic), emotion regulation as a psychotherapy research (doi:10.1177/1754073917742706) and other works in computational social science fields dealing with subtle emotion recognition. (buechel2016emotion)

Therefore, building an dimensional emotion detection model from annotated corpus will be highly useful. However, such annotated resources are surprisingly scarce. There are few corpus having full VAD annotations (buechel2017emobank), or only having that of VA. (preotiuc-pietro-etal-2016-modelling; yu-etal-2016-building) One could build such resource through a corpus labeling by using best-worst scaling (kiritchenko-mohammad-2017-best). Instead, we examine a novel way to predict dimensional emotion (VAD) scores from relatively common resources which are corpus annotated with coarse-grained basic categorical emotions. (scherer1994isear; alm2005tales; aman2007blogs; mohammad2012emotional; sintsova2013olymplex; li2017dailydialog; schuff2017ssec; shahraki2017cbet; SemEval2018Task1)

In this paper, we propose a framework to learn dimensional VAD scores from corpus with categorical emotion labels. We demonstrate our idea by using pre-trained language model BERT (BERT) and fine-tune it through our approach. In detail, our model learns conditional VAD distributions through supervision of categorical emotion labels, in order to use them to compute VAD scores as well as categorical emotion labels for a given sentence.

In summary, our contributions are as follows:

  • We propose a framework which enables learning to predict VAD scores from a corpus with categorical emotions annotations.

  • Our model trained only with categorical emotion labels can predict VAD scores which shows significant positive correlations to corresponding ground truth VAD scores.

  • Our model can be fine-tuned once again with supervision of VAD scores to outperform state-ot-the-art dimensional emotion detection models.

2 Approach

Here we describe how we predict VAD scores for a given text from a model trained on a dataset with categorical emotion annotations.

Overview. The key idea is to train an emotion detection model to predict each of the VAD distributions conditioned on a given text, rather than directly predict categorical emotion labels as like conventional emotion classifiers. We show that it is possible even if we only have categorical emotion labels because those categorical emotion labels can also have VAD scores. Thus one can sort the labels by each VAD dimensions to obtain (sparse) ground truth conditional VAD distributions for a given text. (Fig. 1a, 1b) Then a model can be trained to predict VAD distributions by minimizing the distance between predicted and ground truth distributions, allowing the model to predict not only VAD scores for regression (expectations of predicted distributions, Fig. 1d) but also pick a emotion label within a given set of categorical labels for classification. (argmax of emotion labels, Fig. 1c)

Model Architecture. (Fig 1a) Formally, an emotion detection model is P(e|X) where e is an emotion drawn from a set of pre-defined categorical emotions eE={joy,happy,anger,sad,} and X={x1,x2,,xn} is a sequence of symbols xi representing an input text. Usually, e is represented as an one-hot vector in emotion classification task.

Unlike classification models directly training P(e|X), we aim to learn each distribution of V, A, D from a pair of input text X and categorical labels. To this end, we map categorical emotion labels to three-dimensional VAD space, e=(v,a,d), using NRC-VAD Lexicon (mohammad-2018-obtaining). For example, an emotion label ”joy” is mapped to (0.980, 0.824, 0.794) and ”sad” (0.225, 0.333, 0.149) in the VAD space. By using this coordinates, now our model tries to predict the following distribution:

P(e|X)=P(v,a,d|X) (1)

Furthermore, since each dimensions in VAD space are nearly independent, (russell1977evidence), we assume that the dimensions are mutually independent. So the joint distribution could be decomposed into product of three conditional distributions:

P(v,a,d|X)=P(v|X)P(a|X)P(d|X) (2)

For each decomposed conditional distributions, we would use any type of trainable function with sufficient complexity to capture linguistic patterns from given input. As a demonstration, we use pre-trained bidirectional language model BERT (BERT), which shows state-of-the-art performances in natural language understanding tasks if fine-tuned over task-specific datasets. We stack a softmax or sigmoid activation layer over hidden state corresponding to [CLS] token in BERT for each conditional distributions.

Model Training. (Fig 1b) To train our model, we should obtain target conditionals for each P(v|X),P(a|X),P(d|X) from categorical emotion labels. So we simply sort categorical emotions in E by V, A, D scores respectively, based on the mapped VAD coordinates. For example, if we have four emotions in the categorical labels E={joy,sad,happy,anger} and they have corresponding valence score (0.980, 0.225, 1,000, 0.167) in NRC-VAD (mohammad-2018-obtaining), then we could sort label orders to (anger, sad, joy, happy) and corresponding one-hot labels to obtain the target conditional P(v|X). In other words, by rearranging label positions ascending order of valence scores, sorted one-hot labels can be treated as a proxy of target conditionals. We sort labels in terms of A, D to obtain the other conditionals as well. Note that these conditionals will be sparse because we only have |E| points for each VAD dimensions.

Next, we minimize the distances between the true and predicted P(|X)s. Since we sorted the labels, there are orders between classes. These orders should be taken into account during optimization, thus we minimize the squared Earth Mover’s Distance (EMD) loss (hou2017squared) between the true and predicted P(|X)s to consider the order between labels. EMD loss is as follows:

EMD(𝐩,𝐩^)=i=1C(CDFi(𝐩)-CDFi(𝐩^))2 (3)

where 𝐩 is a true conditional and 𝐩^ is a predicted conditional. This loss is designed to consider the distance between classes in an ordered classification problem, giving more penalties if a model chooses a class far from the correct class using a distance measure. It computes the squared difference between the cumulative distribution function of 𝐩 and corresponding 𝐩^.

Note that Eq. 3 has an assumption that the probability mass of 𝐩 and 𝐩^ should be the same. In single label case, i.e., if the annotated categorical emotion label can appear only once for each text, it is satisfied since 𝐩 and 𝐩^ is output of a softmax layer, which is having the sum always summed up to one. However, in multi-label case, this assumption is violated because generally sigmoid activation layer is used to represent positive probabilities for each class independently. Thus we slightly change the Eq. 3 to satisfy the assumption, defining interclass EMD loss as follows:

EMDinter(𝐩,𝐩^)=i=1C(CDFi(𝐩)-CDFi(𝐩^)2 (4)

where p and p^ are normalized p and p^ which divided to its corresponding sum of probabilities. We also introduce intraclass EMD loss:

EMDintra(𝐩𝐜,𝐩𝐜^)=i=1C(CDFi(𝐩𝐜)-CDFi(𝐩𝐜^)2 (5)

where pc is true (p,1-p) and pc^ is predicted (p,1-p) for class c. Finally we use EMD loss for multi-labeled case as follows:

EMD(𝐩,𝐩^)=EMDinter+EMDintra (6)

Next, we minimize the sum of three squared EMD losses between target and predicted distributions for each of VAD dimensions:

l=EMD(𝐯,𝐯^)+EMD(𝐚,𝐚^)+EMD(𝐝,𝐝^) (7)

where 𝐯, 𝐚, 𝐝 denote target and 𝐯^, 𝐚^, 𝐝^ predicted conditional distributions.

Predicting categorical Emotion Labels. (Fig. 1c) Based on model’s predicted VAD distributions, we can pick one emotion label from a given set E as like conventional emotion classifiers. By computing the product of predicted p(v|X), p(a|X), p(d|X), we obtain predicted p(v,a,d|X), assuming conditional independence. Then we can pick a emotion label eE as follows:

argmax{v,a,d}=eEP(v,a,d|X) (8)

Since we only have |E| given emotion labels, we compare the joint probabilities of (v,a,d)=eE and pick one emotion label having the maximum probability among labels (single-label case, Eq. 8), or multiple labels with probability over a certain threshold (multi-label case). The threshold is a hyperparameter of the model, set to 0.125 (=0.53)

Predicting Continuous VAD Scores. (Fig. 1d) We can further compute the expectations of predicted conditionals; p(v|X), p(a|X), p(d|X) to predict the continuous VAD scores.

vX=E(P(v|X)),aX=E(P(a|X)),dX=E(P(d|X)) (9)

Once again, we use the VAD scores in (mohammad-2018-obtaining) for each dimension when computing the expectations. This allows us to predict continuous VAD scores from the model which is trained over categorical emotion annotations.

3 Experiments

In this section, we show our experimental setups. Throughout these experiments, we mainly focus on demonstrating our approach can effectively predict continuous emotional dimensions (VAD scores) only with categorical emotion labels.

3.1 Dataset

We use three datasets consist of text and corresponding emotion annotations. Two of them have categorical emotion labels, and the other is VAD-annotated corpus.

SemEval 2018 E-c (SemEval). A multi-labeled categorical emotion annotated corpus which contains 10,983 tweets and corresponding labels for presence-absence of 11 emotions. (SemEval2018Task1) We abbreviate this dataset hereafter SemEval.

ISEAR. A single-labled categorical emotion annoated corpus contains 7,666 sentences. A label can have only one emotion among 7 categorical emotions. (scherer1994isear)

EmoBank. Sentences paired with continuous VAD scores as labels. This corpus contains 10,062 sentences collected across 6 domains 2 perspectives. Each sentence has three scores representing VAD in range of 1 to 5. Unless otherwise noted, we use weighted average of VAD scores as ground truth scores, which is recommended by EmoBank authors. (buechel2017emobank)

3.2 Predicting Categorical Emotion Labels.

We examine classification performances of our approach and compare them to state-of-the-art emotion classification models. We use accuracy, macro F1 score, and micro F1 score for evaluation metrics.

MT-CNN. A convolutional neural network for text classification trained by multi-task learning. (zhang2018text) The model jointly learns classification labels and emotional distributions of a given text. The emotion distribution represents multiple emotions in a given sentence, which is normalized affective term counts extracted by emotion lexicons. The model reaches state-of-the-art classification accuracy and F1 score on the ISEAR.

NTUA-SLP. A classification model using deep self-attention layers over Bi-LSTM hidden states. The models is pre-trained on general tweets and ‘SemEval 2017 task 4A’, then fine-tuned over all ‘SemEval 2018 subtasks’, in order to transfer knowledge learnt to each subtasks. (baziotis2018ntua) The model took the first place in multi-labeled emotion classification task on SemEval dataset.

BERT-Large (Classification). A pre-trained bidrectional language model based on stacked multiple Transformers (46201). The model shows state-of-the-art performance in various natural language understanding tasks after fine-tuned over task-specific datasets. (BERT). We add a linear transformation layer with sigmoid activation on BERT for training on a multi-labeled dataset (SemEval) or softmax activation for single-labeled dataset (ISEAR). Like conventional text classifiers, these are optimized by minimizing cross-entropy loss between predicted distributions and one-hot labels.

BERT-Large (Ours, SemEval). We use BERT again and fine-tune the model with our objective functions. For a multi-labeled dataset (SemEval), we minimize Eq. 7 with Eq. 6 for each VAD dimensions. This model can choose an emotion label in E by Eq. 8.

BERT-Large (Ours, ISEAR). We fine-tune another BERT with our approach on ISEAR. This model is optimized by minimizing Eq. 7 with Eq. 3 for each VAD dimensions. Like the model above, this model can predict an emotion label by Eq. 8 as well.

3.3 Predicting Continuous VAD scores.

Next, we investigate VAD score prediction performance of our approach and compare them to state-of-the-art VAD regression models. Since training objectives of models vary, we prefer Pearson’s correlation coefficient between model’s VAD predictions and ground truth scores for an evaluation metric.

3.3.1 Zero-shot Predictions

We refer following two performances as zero-shot prediction performances because these models are not trained over EmoBank, which means the model is trained without supervision of any VAD score labels. These models use entire EmoBank as an evaluation set. We focus on these results since we aim to predict VAD scores from the model trained over corpus annotated with categorical emotion labels.

BERT-Large (Ours, SemEval). We compute VAD score predictions by using Eq. 9 from our model trained on SemEval, which is the same model used in predicting categorical emotion labels.

BERT-Large (Ours, ISEAR). Like the model above, we also compute VAD scores from our model trained on ISEAR.

3.3.2 Predictions after Supervised Learning

Unlike previous models, followings are trained by supervised learning on the VAD score labels in EmoBank. These results allow us to evaluate the extent of zero-shot prediction performances, and further we can see how much the zero-shot prediction model could be improved if VAD annotations are available.

AAN. Adversarial Attention Network for dimensional emotion regression which learns to discriminate VAD dimension scores. (zhu-etal-2019-adversarial) Pearson correlations of predicted and ground truth of VAD scores in EmoBank are reported. Note that the scores are reported by 2 perspectives and 6 domains respectively, thus we use the highest VAD correlations among perspective and domains for comparison.

Ensemble. Multi-task ensemble neural networks which learns to predict VAD scores, sentiment, and their intensity simultaneously. (8756111) The model is recently shown to be effective on the VAD regression.

SRV-SLSTM. Predicting VAD scores through variational autoencoders trained by semi-supervised learning, which shows state-of-the-art performance on the VAD score prediction task. (wu2019semi) The model shows highest performance when using 40% of labeled Emobank data, so we compare our model’s performances to that scores.

BERT-Large (Ours, EBSemEval). We fine-tune once again our BERT-Large (SemEval) on Emobank dataset. We split Emobank to train, valid, test set with the ratio of 6:2:2, then train the model and report the correlation between predicted and ground truth VAD scores on the test set. Specifically, we remove the final linear layer with softmax or sigmoid activations used for training with categorical labels, and we add a new linear layer with relu activations for VAD score predictions. Then all parameters were fine-tuned once again by minimizing mean squared error loss (MSE) between predicted VAD scores and corresponding VAD scores. Through this model, we investigate the effectiveness of our approach as an parameter initialization strategy of the model for VAD regression where the VAD annotations are available.

3.4 Experimental Details.

In all experiment, we specifically use BERT-Large uncased model.11 1 We set the learning rate to 2e-5 with 3 epoch of warm-up period. The batch size is to 64, then we stop fine-tuning all of the layers when the validation loss is minimized. We use single TPU for optimization, and all of the fine-tuning steps were converged within 10 epochs taking an hour.

4 Results

Dataset EmoBank SemEval 2018 E-c ISEAR
Task Regression
Model Scheme V (r) A (r) D (r)
MT-CNN (zhang2018text) - - - - - - - - 0.668
NTUA-SLP (baziotis2018ntua) - - - - 0.528 0.701 0.588 - -
BERT-Large (Classification, ep3) - - - - 0.534 0.697 0.572 0.704 0.700
BERT-Large (Ours, SemEval) Zero-shot 0.659 0.327 0.287 0.500 0.695 0.572 - -
BERT-Large (Ours, ISEAR) Zero-shot 0.502 0.069 0.236 - - - 0.695 0.688
AAN (zhu-etal-2019-adversarial) Supervised 0.424 0.352 0.265 - - - - -
Ensemble (8756111) Supervised 0.635 0.375 0.277 - - - - -
SRV-SLSTM (wu2019semi) Semi-supervised 0.620 0.508 0.333 - - - - -
BERT-Large (Ours, EBSemEval) Supervised 0.765 0.583 0.416 - - - - -
Table 1: Performance of VAD score prediction and categorical emotion class prediction. With fine-tuning pre-trained BERT-Large, we show comparable performance to state-of-the-art models in classification and significant positive correlations with VAD scores using only the categorical emotion annotations. If our model trained on SemEval is fine-tuned on EmoBank, it outperforms all the state-of-the-art VAD regression models.

We present our experimental results. First, we elaborate the zero-shot VAD score prediction results of our models, and then we compare these results to that of supervise models. We also show classification performances of our model and comparison models.

Zero-Shot VAD score Predictions. The results are shown in Table 1. When our model is trained on SemEval and tested on Emobank, the predicted VAD scores show significant positive Pearson’s correlation coefficients with target VAD scores in EmoBank. The correlation in valence (V) show highest score among the dimensions (r=.659, p<.001), followed by arousal (A) (r=.327, p<.001), and dominance (D) (r=.287, p<.001). For our model trained on ISEAR dataset, the scores also show significant positive Pearson’s r. The correlation in V dimension (r=.502, p<.001), followed by D (r=.236, p<.001), and A (r=.069, p<.001).

The correlations of SemEval for all dimension are higher than the score of ISEAR. This is because emotion labels in SemEval have more information than that of ISEAR. First, SemEval has 11 categorical emotion annotations whereas ISEAR has 7 labels. More number of labels leads to less sparse VAD target distributions, thus our model can distinguish the extent of VAD more easily where the more number of labels exists. Second, SemEval can have multiple emotion labels for every sentences, however ISEAR has only one label. Apparently, these multiple emotion labels makes the possible range of the expected VAD scores much wider than that of single emotion labels. If a sentence always should have a single label, then the predicted VAD distribution must be summed up to one. Otherwise, multiple labels enables the distributions to have much larger value of the sum, which leads to wider range of the expected values that help the model distinguish the degree of VAD dimensions for a given sentence.

Note that we observe the correlation in A dimension of ISEAR is low. We see that the standard deviation of arousal scores of ISEAR labels ‘anger’, ‘disgust’, ‘fear’, ‘sadness’, ‘shame’, ‘joy’, ’guilt’ is lower (.191) than other dimensions, (V: .328, D: .237) and actually it becomes much lower when only one label ’sadness’, is removed, dropping to (.105). This makes model difficult to differentiate labels in terms of the degree of arousal, leading to lower correlation with target scores for the A dimension.

Comparison to VAD predictions of Supervised Models. Three comparison models (AAN, Ensemble, SRV-SLTSTM) in Table 1 are trained by supervision of VAD scores. Since our model trained on SemEval shows better performance than ISEAR, hereafter we compare the scores from SemEval to that of comparison models.

Among those models, Ensemble shows the highest correlation on V dimension (.635), SRV-SLSTM reaches to the highest correlation on A (.375) and D (.333) dimensions. We highlight our model trained on SemEval shows even better correlation in V dimension (.659) without any supervision of VAD score labels. The correlation of A (.327) is followed which is slightly lower than that of state-of-the-art models, and D (.287) is comparable to that of the Ensemble. Overall, we see that zero-shot prediction performance are fairly comparable with those of state-of-the-art models.

Furthermore, we present the result from our another model, which is trained on SemEval and then fine-tuned on training set of EmoBank corpus and VAD score labels. We could see that if we continue training our model with supervision of VAD labels, our model outperforms all of the state-of-the-art models with a large margin. The VAD fine-tuned model shows the significant correlation in all V (r=.765, p<.001), A (r=.583, p<.001) and D (r=.416, p<.001) dimensions. These are (+.130, +.075, +.083) improvement of the correlation from the state-of-the-arts for VAD dimensions, respectively.

Categorical Label Classification. Next, classification performances our model and that of comparison models are reported. In case of SemEval, fine-tuning BERT as like a conventional classifier (BERT-Large, classification) shows higher macro F1 score (.534) than NTUA-SLT and comparable micro F1 score (.697) and multi-label accuracy (.572). Fine-tuning BERT on ISEAR shows similar results. The BERT classifier outperforms MT-CNN with higher micro f1 score. (.700)

Also, our model also shows comparable classification performance with comparison models. Our model shows higher macro f1 score (.688) on ISEAR, which is higher than that of MT-CNN, In case of SemEval, however, our model shows slightly lower performance to that of NTUA-SLP.

5 Ablation Study

Model V (r) A (r) D (r)
  1. BERT (Ours, SemEval) 0.659 0.327 0.287
  2. BERT (Random Init., EB) 0.600 0.536 0.344
  3. BERT (Ours, EBSemEval) 0.765 0.583 0.416
  4. BERT (Regression, EB) 0.787 0.632 0.498
Table 2: Ablation Study results of our models. Given that the model architecture is the same (BERT-Large), the architecture is effective for the VAD regression task, and initialization with our model trained on categorical emotion annotation helps to improve the performance as well. Using pre-trained BERT-Large shows slightly better results.

We further conduct ablation study to investigate our model’s VAD prediction performances. Since we use pre-trained BERT and fine-tune them with different datasets, the effect of pre-training and fine-tuning should be decomposed to understand the source of improvements.

In Table. 2, we present four models for ablation study which all having the same neural network architecture (BERT-Large) to control the size and structure of the model. Model 1 is our model trained on SemEval, and Model 3 is fine-tuned on EmoBank with initialization of trained weights of Model 1. This is equivalent to training Model 1 continuously using supervision of EmoBank labels. Model 2 use BERT but all the weights are randomly initialized, which means it does not use pre-trained language model weights, then the model is trained on EmoBank. Lastly, Model 4 is directly fine-tuning the BERT with EmoBank VAD labels, starting from pre-trained language model weights.

As shown in Table. 2, we point out Model 2 is already comparable to state-of-the-art VAD prediction models in Table. 1. Specifically, Model 2 outperforms SRV-SLSTM in A and D dimensions. For V dimension, Model 2 underperforms Model 1 and SRV-SLSTM. Overall, this indicates that multi-layer Transformers architecture is effective for VAD score regression even without any pre-trained knowledge. Also, we see further improvement on Model 3, which means initializing the model with our approach is better than just using random weights to start training.

Note that we observe that Model 4 shows better performance in all V (r=.787, p<.001), A (r=.632, p<.001) and D (r=.498, p<.001) dimensions. It indicates that using pre-training bidirectional language model weights is better initialization strategy rather than using our model. This is because Model 1 is fine-tuned once to predict VAD distributions based on categorical emotion labels which resulting in forgetting the general linguistic representation of a given text from pre-trained BERT. So it seems starting to training from general representation of text allows to predict VAD scores better, rather than the representations trained from categorical emotion labels. It might be partially due to the suboptimal fine-tuning strategy for a fine-tuned model. However, it is beyond the scope of this work, so we plan to investigate how to fine-tune a fine-tuned model effectively in future work.

6 Qualitative Examples

Tweet categorical Label Nearest Neighbors from VAD scores
Gooood morning it is such a #blessing to see another day
all that Read this I hope have a great morning
joy, optimism
reaffirm, shimmer,
brighten, affections, mythological
Happy Winning Wednesday!!
Each day is a day of new possibilities.
Keep pushing and keep your head up.
#live #love #laugh #reachforthestars
joy, love,
incentive, alive, reborn,
radiance, lavish
Not only was and responsible for the
unnecessary outrage of this movie,
but made the director look bad
anger, disgust
refusal, liar, falsified,
disrespect, unsavory
you begin to irritate me, primitive anger, disgust
negativity, abandon, dontlikeyou,
depression, morgue
Mentally suffered #iwanttodie #worthless
#lifewithoutcolor #pain #suicidal
disgust, pessimism,
orphaned, wasting, decomposed,
hopelessness, dead
Table 3: Qualitative examples of predictions from our model trained on SemEval. Examples Tweets are from test set of SemEval. We present predicted categorical emotion labels, and corresponding top 5 nearest neighbor words in NRC-VAD-Lexicons with respect to the model predictions of VAD scores.

In Table 3, we show examples predicted from an our model trained on SemEval. The table prsents annotated tweets from SemEval test set and corresponding predicted categorical labels, and top 5 nearest neighbor emotional words with respect to predicted VAD scores. For these 5 tweets, our model correctly predicted categorical emotion labels. We elaborate how we find the nearest neighbor words from the VAD scores.

Given that our model predicted VAD scores, we find nearest neighbor words for that scores by using NRC-VAD-Lexicons. (mohammad-2018-obtaining) We first rescale our model’s predicted VAD scores from 0 to 1 for each VAD dimensions since the lexicons have values from 0 to 1. To do this, we first predict VAD scores for every sentences in SemEval test set and then we rescale the scores by following: (x-min(x))/(max(x)-min(x)), which makes all dimensions to have scores from 0 to 1.

Next, we find nearest neighbor words by using the rescaled VAD values. Euclidean distances between the values and all words in NRC-VAD-Lexicons are computed, and we pick top 5 nearest words among them which have smallest distances. We present the words in the right column of Table 3. These words help us to understand VAD scores more intuitively, and further they could be regarded as automatically generated emotional annotations for a given sentence. In other words, our model can predict categorical emotion labels which is not seen in training time by finding nearest neighbor words in VAD space.

Five examples in Table 3 shows our model can predict categorical emotion labels and further finds suitable emotional words for a given sentence. Especially, for the fifth tweet, our model annotated depressive words (hopelessness, dead) to the given sentence, so it might be extended to detect risky signs of people in needs from social media.

7 Related Work

VAD Dimensions of Emotions. Research of emotion representation model has gone through the history of psychology domain. Categorical model of emotion assumes that categorical categories represented by emotion words compose the building blocks of human emotion. Supporting evidence includes six basic emotions (ekman1992argument), and findings of universally adaptive emotions (plutchik1980general). Alternatively, to understand how people conceptualize emotional feelings beholds the dimensional model of emotion. osgood1957measurement suggested initial ideas of emotion coordinates. russell1977evidence further constructed Pleasure or Valence-Arousal-Dominance (PAD, VAD) model, a semantic scale model to rate emotional state, representing an emotional state as a pair of orthogonal coordinates on V-A-D dimensions. Absolute values of the intercorrelations among the three scales show considerable independence among the scales (russell1977evidence). Categorical emotion states can be represented in three-dimensional (VAD) emotion space. Based on emotional dimensions, word-level VAD annotation of English words has been created. (bradley1999affective; Warriner2013) Recently, large-scale annotation of VAD score annotation to English words is developed (mohammad-2018-obtaining), so we leverage this annotation scores for predicting sentence-level VAD scores during training from categorical emotion annotation datasets.

Emotional Distribution Learning. Instead of predicting multiple emotion labels from text, learning emotion distribution itself from text has been proposed (deyu2016emotion). This approach maps text to emotion distribution and respective intensities incorporating Plutchik’s wheel of emotions. Furthermore, distribution learning can be extended to issues of emotion ranking. (zhou2018relevant) Unlike previous approach, our model learns decomposed emotional distributions, which is valence, arousal, dominance distribution of emotions.

8 Discussion and Conclusions

We propose learning to predict VAD scores from the text with categorical emotion annotations. Our framework predicts VAD score distributions for a given text rather can classification probabilities for each class, by minimizing the EMD distances between predicts VAD distributions and sorted label distributions as a proxy of target VAD distributions.

Learning conditional VAD distributions enables predicting categorical emotion classes and continuous VAD scores simultaneously. With fine-tuning pre-trained BERT-Large on SemEval, our approach shows comparable performance in categorical emotion classification task and significant positive correlations with target VAD scores even without supervision of VAD scores. If our model continues supervised training on the VAD labels, our model outperforms state-of-the-art VAD regression models. Ablation study shows this is because superiority of the multi-layer Transformer architecture as well as effective initialization strategy of fine-tuning the model starting from our model for VAD score prediction. We further find nearest neighbor words from the predicted VAD scores of our model, which could be regarded as our model can automatically generate categorical emotion labels which are not be seen in training time to a corresponding input sentence.

We hope our framework would help researchers to build a human-annotated sentence-level VAD emotion dataset by providing machine-annotated VAD scores as a start, or use it just as VAD score prediction model. Most of the languages except English would not have such corpus with VAD annotations, so our model will be helpful to build a multilingual resource using multilingual corpora with categorical emotion labels. (ohman-etal-2018-creating) Also, further work will focus on developing a model giving more sensible VAD scores without VAD annotations.