Abstract
We propose a framework which makes a model predict finegrained dimensionalemotions (valencearousaldominance, VAD) trained on corpus annotated withcoarsegrained categorical emotions. We train a model by minimizing EMDdistances between predicted VAD score distribution and \textit{sorted}categorical emotion distributions in terms of VAD, as a proxy of target VADscore distributions. With our model, we can simultaneously classify a givensentence to categorical emotions as well as predict VAD scores. We usepretrained BERTLarge and finetune on SemEval dataset (11 categoricalemotions) and evaluate on EmoBank (VAD dimensional emotions), in order to showour approach reaches comparable performance to that of the stateoftheartclassifiers in categorical emotion classification task and significant positivecorrelations with ground truth VAD scores. Also, if one continues training ourmodel with supervision of VAD labels, it outperforms stateoftheart VADregression models. We further present examples showing our model can annotateemotional words suitable for a given text even those words are not seen ascategorical labels during training.
Quick Read (beta)
Toward Dimensional Emotion Detection
from Categorical Emotion Annotations
Abstract
We propose a framework which makes a model predict finegrained dimensional emotions (valencearousaldominance, VAD) trained on corpus annotated with coarsegrained categorical emotions. We train a model by minimizing EMD distances between predicted VAD score distribution and sorted categorical emotion distributions in terms of VAD, as a proxy of target VAD score distributions. With our model, we can simultaneously classify a given sentence to categorical emotions as well as predict VAD scores. We use pretrained BERTLarge and finetune on SemEval dataset (11 categorical emotions) and evaluate on EmoBank (VAD dimensional emotions), in order to show our approach reaches comparable performance to that of the stateoftheart classifiers in categorical emotion classification task and significant positive correlations with ground truth VAD scores. Also, if one continues training our model with supervision of VAD labels, it outperforms stateoftheart VAD regression models. We further present examples showing our model can annotate emotional words suitable for a given text even those words are not seen as categorical labels during training.
Toward Dimensional Emotion Detection
from Categorical Emotion Annotations
Sungjoon Park^{1}, Jiseon Kim^{1}, Jaeyeol Jeon, Heeyoung Park^{2}, Alice Oh^{1} ^{1} School of Computing, KAIST, Republic of Korea ^{2} Department of Psychology, Seoul National University, Republic of Korea [email protected], [email protected], [email protected], [email protected], [email protected]
1 Introduction
Humans can feel and express complex emotions beyond the basic emotions (ekman1992argument; plutchik2001nature) in daily basis. To represent these various emotions systematically, a dimensional emotion model like the ValenceArousalDominance (VAD) model is commonly used. (russell1977evidence) This model maps emotional states to orthogonal dimensional VAD space, showing various emotions can be projected into the space with measurable distances from one another. Since dimensional models pose an emotion as realvalued vector in the space, it is likely to account for subtle emotional expressions compared to categorical models which employ a finite number basic emotions. With dimensional VAD models, capturing finegrained emotions could benefit clinical natural language processing (NLP) researches (Desmet:2013:EDS:2506578.2506869; sahana2015automatic), emotion regulation as a psychotherapy research (doi:10.1177/1754073917742706) and other works in computational social science fields dealing with subtle emotion recognition. (buechel2016emotion)
Therefore, building an dimensional emotion detection model from annotated corpus will be highly useful. However, such annotated resources are surprisingly scarce. There are few corpus having full VAD annotations (buechel2017emobank), or only having that of VA. (preotiucpietroetal2016modelling; yuetal2016building) One could build such resource through a corpus labeling by using bestworst scaling (kiritchenkomohammad2017best). Instead, we examine a novel way to predict dimensional emotion (VAD) scores from relatively common resources which are corpus annotated with coarsegrained basic categorical emotions. (scherer1994isear; alm2005tales; aman2007blogs; mohammad2012emotional; sintsova2013olymplex; li2017dailydialog; schuff2017ssec; shahraki2017cbet; SemEval2018Task1)
In this paper, we propose a framework to learn dimensional VAD scores from corpus with categorical emotion labels. We demonstrate our idea by using pretrained language model BERT (BERT) and finetune it through our approach. In detail, our model learns conditional VAD distributions through supervision of categorical emotion labels, in order to use them to compute VAD scores as well as categorical emotion labels for a given sentence.
In summary, our contributions are as follows:

•
We propose a framework which enables learning to predict VAD scores from a corpus with categorical emotions annotations.

•
Our model trained only with categorical emotion labels can predict VAD scores which shows significant positive correlations to corresponding ground truth VAD scores.

•
Our model can be finetuned once again with supervision of VAD scores to outperform stateottheart dimensional emotion detection models.
2 Approach
Here we describe how we predict VAD scores for a given text from a model trained on a dataset with categorical emotion annotations.
Overview. The key idea is to train an emotion detection model to predict each of the VAD distributions conditioned on a given text, rather than directly predict categorical emotion labels as like conventional emotion classifiers. We show that it is possible even if we only have categorical emotion labels because those categorical emotion labels can also have VAD scores. Thus one can sort the labels by each VAD dimensions to obtain (sparse) ground truth conditional VAD distributions for a given text. (Fig. 1a, 1b) Then a model can be trained to predict VAD distributions by minimizing the distance between predicted and ground truth distributions, allowing the model to predict not only VAD scores for regression (expectations of predicted distributions, Fig. 1d) but also pick a emotion label within a given set of categorical labels for classification. (argmax of emotion labels, Fig. 1c)
Model Architecture. (Fig 1a) Formally, an emotion detection model is $P(eX)$ where $e$ is an emotion drawn from a set of predefined categorical emotions $e\in E=\{joy,happy,anger,sad,\mathrm{\dots}\}$ and $X=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{n}\}$ is a sequence of symbols ${x}_{i}$ representing an input text. Usually, $e$ is represented as an onehot vector in emotion classification task.
Unlike classification models directly training $P(eX)$, we aim to learn each distribution of V, A, D from a pair of input text $X$ and categorical labels. To this end, we map categorical emotion labels to threedimensional VAD space, $e=(v,a,d)$, using NRCVAD Lexicon (mohammad2018obtaining). For example, an emotion label ”joy” is mapped to (0.980, 0.824, 0.794) and ”sad” (0.225, 0.333, 0.149) in the VAD space. By using this coordinates, now our model tries to predict the following distribution:
$$P(eX)=P(v,a,dX)$$  (1) 
Furthermore, since each dimensions in VAD space are nearly independent, (russell1977evidence), we assume that the dimensions are mutually independent. So the joint distribution could be decomposed into product of three conditional distributions:
$$P(v,a,dX)=P(vX)P(aX)P(dX)$$  (2) 
For each decomposed conditional distributions, we would use any type of trainable function with sufficient complexity to capture linguistic patterns from given input. As a demonstration, we use pretrained bidirectional language model BERT (BERT), which shows stateoftheart performances in natural language understanding tasks if finetuned over taskspecific datasets. We stack a softmax or sigmoid activation layer over hidden state corresponding to [CLS] token in BERT for each conditional distributions.
Model Training. (Fig 1b) To train our model, we should obtain target conditionals for each $P(vX),P(aX),P(dX)$ from categorical emotion labels. So we simply sort categorical emotions in $E$ by V, A, D scores respectively, based on the mapped VAD coordinates. For example, if we have four emotions in the categorical labels $E=\{joy,sad,happy,anger\}$ and they have corresponding valence score (0.980, 0.225, 1,000, 0.167) in NRCVAD (mohammad2018obtaining), then we could sort label orders to (anger, sad, joy, happy) and corresponding onehot labels to obtain the target conditional $P(vX)$. In other words, by rearranging label positions ascending order of valence scores, sorted onehot labels can be treated as a proxy of target conditionals. We sort labels in terms of A, D to obtain the other conditionals as well. Note that these conditionals will be sparse because we only have $E$ points for each VAD dimensions.
Next, we minimize the distances between the true and predicted $P(\cdot X)$s. Since we sorted the labels, there are orders between classes. These orders should be taken into account during optimization, thus we minimize the squared Earth Mover’s Distance (EMD) loss (hou2017squared) between the true and predicted $P(\cdot X)$s to consider the order between labels. EMD loss is as follows:
$$EMD(\mathbf{p},\widehat{\mathbf{p}})=\sum _{i=1}^{C}{(CD{F}_{i}(\mathbf{p})CD{F}_{i}(\widehat{\mathbf{p}}))}^{2}$$  (3) 
where $\mathbf{p}$ is a true conditional and $\widehat{\mathbf{p}}$ is a predicted conditional. This loss is designed to consider the distance between classes in an ordered classification problem, giving more penalties if a model chooses a class far from the correct class using a distance measure. It computes the squared difference between the cumulative distribution function of $\mathbf{p}$ and corresponding $\widehat{\mathbf{p}}$.
Note that Eq. 3 has an assumption that the probability mass of $\mathbf{p}$ and $\widehat{\mathbf{p}}$ should be the same. In single label case, i.e., if the annotated categorical emotion label can appear only once for each text, it is satisfied since $\mathbf{p}$ and $\widehat{\mathbf{p}}$ is output of a softmax layer, which is having the sum always summed up to one. However, in multilabel case, this assumption is violated because generally sigmoid activation layer is used to represent positive probabilities for each class independently. Thus we slightly change the Eq. 3 to satisfy the assumption, defining interclass EMD loss as follows:
$$EM{D}_{inter}(\mathbf{p},\widehat{\mathbf{p}})=\sum _{i=1}^{C}(CD{F}_{i}(\u27e8\mathbf{p}\u27e9)CD{F}_{i}{(\u27e8\widehat{\mathbf{p}}\u27e9)}^{2}$$  (4) 
where $\u27e8p\u27e9$ and $\u27e8\widehat{p}\u27e9$ are normalized $p$ and $\widehat{p}$ which divided to its corresponding sum of probabilities. We also introduce intraclass EMD loss:
$$EM{D}_{intra}({\mathbf{p}}_{\mathbf{c}},\widehat{{\mathbf{p}}_{\mathbf{c}}})=\sum _{i=1}^{C}(CD{F}_{i}({\mathbf{p}}_{\mathbf{c}})CD{F}_{i}{(\widehat{{\mathbf{p}}_{\mathbf{c}}})}^{2}$$  (5) 
where ${p}_{c}$ is true $(p,1p)$ and $\widehat{{p}_{c}}$ is predicted $(p,1p)$ for class $c$. Finally we use EMD loss for multilabeled case as follows:
$$EMD(\mathbf{p},\widehat{\mathbf{p}})=EM{D}_{inter}+EM{D}_{intra}$$  (6) 
Next, we minimize the sum of three squared EMD losses between target and predicted distributions for each of VAD dimensions:
$$l=EMD(\mathbf{v},\widehat{\mathbf{v}})+EMD(\mathbf{a},\widehat{\mathbf{a}})+EMD(\mathbf{d},\widehat{\mathbf{d}})$$  (7) 
where $\mathbf{v}$, $\mathbf{a}$, $\mathbf{d}$ denote target and $\widehat{\mathbf{v}}$, $\widehat{\mathbf{a}}$, $\widehat{\mathbf{d}}$ predicted conditional distributions.
Predicting categorical Emotion Labels. (Fig. 1c) Based on model’s predicted VAD distributions, we can pick one emotion label from a given set $E$ as like conventional emotion classifiers. By computing the product of predicted $p(vX)$, $p(aX)$, $p(dX)$, we obtain predicted $p(v,a,dX)$, assuming conditional independence. Then we can pick a emotion label $e\in E$ as follows:
$$\underset{\{v,a,d\}=e\in E}{\mathrm{arg}\mathrm{max}}P(v,a,dX)$$  (8) 
Since we only have $E$ given emotion labels, we compare the joint probabilities of $(v,a,d)=e\in E$ and pick one emotion label having the maximum probability among labels (singlelabel case, Eq. 8), or multiple labels with probability over a certain threshold (multilabel case). The threshold is a hyperparameter of the model, set to 0.125 (=${0.5}^{3}$)
Predicting Continuous VAD Scores. (Fig. 1d) We can further compute the expectations of predicted conditionals; $p(vX)$, $p(aX)$, $p(dX)$ to predict the continuous VAD scores.
$${v}_{X}=E(P(vX)),{a}_{X}=E(P(aX)),{d}_{X}=E(P(dX))$$  (9) 
Once again, we use the VAD scores in (mohammad2018obtaining) for each dimension when computing the expectations. This allows us to predict continuous VAD scores from the model which is trained over categorical emotion annotations.
3 Experiments
In this section, we show our experimental setups. Throughout these experiments, we mainly focus on demonstrating our approach can effectively predict continuous emotional dimensions (VAD scores) only with categorical emotion labels.
3.1 Dataset
We use three datasets consist of text and corresponding emotion annotations. Two of them have categorical emotion labels, and the other is VADannotated corpus.
SemEval 2018 Ec (SemEval). A multilabeled categorical emotion annotated corpus which contains 10,983 tweets and corresponding labels for presenceabsence of 11 emotions. (SemEval2018Task1) We abbreviate this dataset hereafter SemEval.
ISEAR. A singlelabled categorical emotion annoated corpus contains 7,666 sentences. A label can have only one emotion among 7 categorical emotions. (scherer1994isear)
EmoBank. Sentences paired with continuous VAD scores as labels. This corpus contains 10,062 sentences collected across 6 domains 2 perspectives. Each sentence has three scores representing VAD in range of 1 to 5. Unless otherwise noted, we use weighted average of VAD scores as ground truth scores, which is recommended by EmoBank authors. (buechel2017emobank)
3.2 Predicting Categorical Emotion Labels.
We examine classification performances of our approach and compare them to stateoftheart emotion classification models. We use accuracy, macro F1 score, and micro F1 score for evaluation metrics.
MTCNN. A convolutional neural network for text classification trained by multitask learning. (zhang2018text) The model jointly learns classification labels and emotional distributions of a given text. The emotion distribution represents multiple emotions in a given sentence, which is normalized affective term counts extracted by emotion lexicons. The model reaches stateoftheart classification accuracy and F1 score on the ISEAR.
NTUASLP. A classification model using deep selfattention layers over BiLSTM hidden states. The models is pretrained on general tweets and ‘SemEval 2017 task 4A’, then finetuned over all ‘SemEval 2018 subtasks’, in order to transfer knowledge learnt to each subtasks. (baziotis2018ntua) The model took the first place in multilabeled emotion classification task on SemEval dataset.
BERTLarge (Classification). A pretrained bidrectional language model based on stacked multiple Transformers (46201). The model shows stateoftheart performance in various natural language understanding tasks after finetuned over taskspecific datasets. (BERT). We add a linear transformation layer with sigmoid activation on BERT for training on a multilabeled dataset (SemEval) or softmax activation for singlelabeled dataset (ISEAR). Like conventional text classifiers, these are optimized by minimizing crossentropy loss between predicted distributions and onehot labels.
3.3 Predicting Continuous VAD scores.
Next, we investigate VAD score prediction performance of our approach and compare them to stateoftheart VAD regression models. Since training objectives of models vary, we prefer Pearson’s correlation coefficient between model’s VAD predictions and ground truth scores for an evaluation metric.
3.3.1 Zeroshot Predictions
We refer following two performances as zeroshot prediction performances because these models are not trained over EmoBank, which means the model is trained without supervision of any VAD score labels. These models use entire EmoBank as an evaluation set. We focus on these results since we aim to predict VAD scores from the model trained over corpus annotated with categorical emotion labels.
BERTLarge (Ours, SemEval). We compute VAD score predictions by using Eq. 9 from our model trained on SemEval, which is the same model used in predicting categorical emotion labels.
BERTLarge (Ours, ISEAR). Like the model above, we also compute VAD scores from our model trained on ISEAR.
3.3.2 Predictions after Supervised Learning
Unlike previous models, followings are trained by supervised learning on the VAD score labels in EmoBank. These results allow us to evaluate the extent of zeroshot prediction performances, and further we can see how much the zeroshot prediction model could be improved if VAD annotations are available.
AAN. Adversarial Attention Network for dimensional emotion regression which learns to discriminate VAD dimension scores. (zhuetal2019adversarial) Pearson correlations of predicted and ground truth of VAD scores in EmoBank are reported. Note that the scores are reported by 2 perspectives and 6 domains respectively, thus we use the highest VAD correlations among perspective and domains for comparison.
Ensemble. Multitask ensemble neural networks which learns to predict VAD scores, sentiment, and their intensity simultaneously. (8756111) The model is recently shown to be effective on the VAD regression.
SRVSLSTM. Predicting VAD scores through variational autoencoders trained by semisupervised learning, which shows stateoftheart performance on the VAD score prediction task. (wu2019semi) The model shows highest performance when using 40% of labeled Emobank data, so we compare our model’s performances to that scores.
BERTLarge (Ours, EB$\mathrm{\leftarrow}$SemEval). We finetune once again our BERTLarge (SemEval) on Emobank dataset. We split Emobank to train, valid, test set with the ratio of 6:2:2, then train the model and report the correlation between predicted and ground truth VAD scores on the test set. Specifically, we remove the final linear layer with softmax or sigmoid activations used for training with categorical labels, and we add a new linear layer with relu activations for VAD score predictions. Then all parameters were finetuned once again by minimizing mean squared error loss (MSE) between predicted VAD scores and corresponding VAD scores. Through this model, we investigate the effectiveness of our approach as an parameter initialization strategy of the model for VAD regression where the VAD annotations are available.
3.4 Experimental Details.
In all experiment, we specifically use BERTLarge uncased model.^{1}^{1} 1 https://tfhub.dev/google/bert_uncased_L24_H1024_A16/1 We set the learning rate to 2e5 with 3 epoch of warmup period. The batch size is to 64, then we stop finetuning all of the layers when the validation loss is minimized. We use single TPU for optimization, and all of the finetuning steps were converged within 10 epochs taking an hour.
4 Results
Dataset  EmoBank  SemEval 2018 Ec  ISEAR  
Task  Regression 



Model  Scheme  V (r)  A (r)  D (r) 


Acc. 



MTCNN (zhang2018text)                  0.668  
NTUASLP (baziotis2018ntua)          0.528  0.701  0.588      
BERTLarge (Classification, ep3)          0.534  0.697  0.572  0.704  0.700  
BERTLarge (Ours, SemEval)  Zeroshot  0.659  0.327  0.287  0.500  0.695  0.572      
BERTLarge (Ours, ISEAR)  Zeroshot  0.502  0.069  0.236        0.695  0.688  
AAN (zhuetal2019adversarial)  Supervised  0.424  0.352  0.265            
Ensemble (8756111)  Supervised  0.635  0.375  0.277            
SRVSLSTM (wu2019semi)  Semisupervised  0.620  0.508  0.333            
BERTLarge (Ours, EB$\leftarrow $SemEval)  Supervised  0.765  0.583  0.416           
We present our experimental results. First, we elaborate the zeroshot VAD score prediction results of our models, and then we compare these results to that of supervise models. We also show classification performances of our model and comparison models.
ZeroShot VAD score Predictions. The results are shown in Table 1. When our model is trained on SemEval and tested on Emobank, the predicted VAD scores show significant positive Pearson’s correlation coefficients with target VAD scores in EmoBank. The correlation in valence (V) show highest score among the dimensions (r=.659, p$$.001), followed by arousal (A) (r=.327, p$$.001), and dominance (D) (r=.287, p$$.001). For our model trained on ISEAR dataset, the scores also show significant positive Pearson’s $r$. The correlation in V dimension (r=.502, p$$.001), followed by D (r=.236, p$$.001), and A (r=.069, p$$.001).
The correlations of SemEval for all dimension are higher than the score of ISEAR. This is because emotion labels in SemEval have more information than that of ISEAR. First, SemEval has 11 categorical emotion annotations whereas ISEAR has 7 labels. More number of labels leads to less sparse VAD target distributions, thus our model can distinguish the extent of VAD more easily where the more number of labels exists. Second, SemEval can have multiple emotion labels for every sentences, however ISEAR has only one label. Apparently, these multiple emotion labels makes the possible range of the expected VAD scores much wider than that of single emotion labels. If a sentence always should have a single label, then the predicted VAD distribution must be summed up to one. Otherwise, multiple labels enables the distributions to have much larger value of the sum, which leads to wider range of the expected values that help the model distinguish the degree of VAD dimensions for a given sentence.
Note that we observe the correlation in A dimension of ISEAR is low. We see that the standard deviation of arousal scores of ISEAR labels ‘anger’, ‘disgust’, ‘fear’, ‘sadness’, ‘shame’, ‘joy’, ’guilt’ is lower (.191) than other dimensions, (V: .328, D: .237) and actually it becomes much lower when only one label ’sadness’, is removed, dropping to (.105). This makes model difficult to differentiate labels in terms of the degree of arousal, leading to lower correlation with target scores for the A dimension.
Comparison to VAD predictions of Supervised Models. Three comparison models (AAN, Ensemble, SRVSLTSTM) in Table 1 are trained by supervision of VAD scores. Since our model trained on SemEval shows better performance than ISEAR, hereafter we compare the scores from SemEval to that of comparison models.
Among those models, Ensemble shows the highest correlation on V dimension (.635), SRVSLSTM reaches to the highest correlation on A (.375) and D (.333) dimensions. We highlight our model trained on SemEval shows even better correlation in V dimension (.659) without any supervision of VAD score labels. The correlation of A (.327) is followed which is slightly lower than that of stateoftheart models, and D (.287) is comparable to that of the Ensemble. Overall, we see that zeroshot prediction performance are fairly comparable with those of stateoftheart models.
Furthermore, we present the result from our another model, which is trained on SemEval and then finetuned on training set of EmoBank corpus and VAD score labels. We could see that if we continue training our model with supervision of VAD labels, our model outperforms all of the stateoftheart models with a large margin. The VAD finetuned model shows the significant correlation in all V (r=.765, p$$.001), A (r=.583, p$$.001) and D (r=.416, p$$.001) dimensions. These are (+.130, +.075, +.083) improvement of the correlation from the stateofthearts for VAD dimensions, respectively.
Categorical Label Classification. Next, classification performances our model and that of comparison models are reported. In case of SemEval, finetuning BERT as like a conventional classifier (BERTLarge, classification) shows higher macro F1 score (.534) than NTUASLT and comparable micro F1 score (.697) and multilabel accuracy (.572). Finetuning BERT on ISEAR shows similar results. The BERT classifier outperforms MTCNN with higher micro f1 score. (.700)
Also, our model also shows comparable classification performance with comparison models. Our model shows higher macro f1 score (.688) on ISEAR, which is higher than that of MTCNN, In case of SemEval, however, our model shows slightly lower performance to that of NTUASLP.
5 Ablation Study
Model  V (r)  A (r)  D (r) 
ZeroShot  
1. BERT (Ours, SemEval)  0.659  0.327  0.287 
Supervised  
2. BERT (Random Init., EB)  0.600  0.536  0.344 
3. BERT (Ours, EB$\leftarrow $SemEval)  0.765  0.583  0.416 
4. BERT (Regression, EB)  0.787  0.632  0.498 
We further conduct ablation study to investigate our model’s VAD prediction performances. Since we use pretrained BERT and finetune them with different datasets, the effect of pretraining and finetuning should be decomposed to understand the source of improvements.
In Table. 2, we present four models for ablation study which all having the same neural network architecture (BERTLarge) to control the size and structure of the model. Model 1 is our model trained on SemEval, and Model 3 is finetuned on EmoBank with initialization of trained weights of Model 1. This is equivalent to training Model 1 continuously using supervision of EmoBank labels. Model 2 use BERT but all the weights are randomly initialized, which means it does not use pretrained language model weights, then the model is trained on EmoBank. Lastly, Model 4 is directly finetuning the BERT with EmoBank VAD labels, starting from pretrained language model weights.
As shown in Table. 2, we point out Model 2 is already comparable to stateoftheart VAD prediction models in Table. 1. Specifically, Model 2 outperforms SRVSLSTM in A and D dimensions. For V dimension, Model 2 underperforms Model 1 and SRVSLSTM. Overall, this indicates that multilayer Transformers architecture is effective for VAD score regression even without any pretrained knowledge. Also, we see further improvement on Model 3, which means initializing the model with our approach is better than just using random weights to start training.
Note that we observe that Model 4 shows better performance in all V (r=.787, p$$.001), A (r=.632, p$$.001) and D (r=.498, p$$.001) dimensions. It indicates that using pretraining bidirectional language model weights is better initialization strategy rather than using our model. This is because Model 1 is finetuned once to predict VAD distributions based on categorical emotion labels which resulting in forgetting the general linguistic representation of a given text from pretrained BERT. So it seems starting to training from general representation of text allows to predict VAD scores better, rather than the representations trained from categorical emotion labels. It might be partially due to the suboptimal finetuning strategy for a finetuned model. However, it is beyond the scope of this work, so we plan to investigate how to finetune a finetuned model effectively in future work.
6 Qualitative Examples
Tweet  categorical Label  Nearest Neighbors from VAD scores  


joy, optimism 







anger, disgust 


you begin to irritate me, primitive  anger, disgust 





In Table 3, we show examples predicted from an our model trained on SemEval. The table prsents annotated tweets from SemEval test set and corresponding predicted categorical labels, and top 5 nearest neighbor emotional words with respect to predicted VAD scores. For these 5 tweets, our model correctly predicted categorical emotion labels. We elaborate how we find the nearest neighbor words from the VAD scores.
Given that our model predicted VAD scores, we find nearest neighbor words for that scores by using NRCVADLexicons. (mohammad2018obtaining) We first rescale our model’s predicted VAD scores from 0 to 1 for each VAD dimensions since the lexicons have values from 0 to 1. To do this, we first predict VAD scores for every sentences in SemEval test set and then we rescale the scores by following: ($xmin(x))/(max(x)min(x))$, which makes all dimensions to have scores from 0 to 1.
Next, we find nearest neighbor words by using the rescaled VAD values. Euclidean distances between the values and all words in NRCVADLexicons are computed, and we pick top 5 nearest words among them which have smallest distances. We present the words in the right column of Table 3. These words help us to understand VAD scores more intuitively, and further they could be regarded as automatically generated emotional annotations for a given sentence. In other words, our model can predict categorical emotion labels which is not seen in training time by finding nearest neighbor words in VAD space.
Five examples in Table 3 shows our model can predict categorical emotion labels and further finds suitable emotional words for a given sentence. Especially, for the fifth tweet, our model annotated depressive words (hopelessness, dead) to the given sentence, so it might be extended to detect risky signs of people in needs from social media.
7 Related Work
VAD Dimensions of Emotions. Research of emotion representation model has gone through the history of psychology domain. Categorical model of emotion assumes that categorical categories represented by emotion words compose the building blocks of human emotion. Supporting evidence includes six basic emotions (ekman1992argument), and findings of universally adaptive emotions (plutchik1980general). Alternatively, to understand how people conceptualize emotional feelings beholds the dimensional model of emotion. osgood1957measurement suggested initial ideas of emotion coordinates. russell1977evidence further constructed Pleasure or ValenceArousalDominance (PAD, VAD) model, a semantic scale model to rate emotional state, representing an emotional state as a pair of orthogonal coordinates on VAD dimensions. Absolute values of the intercorrelations among the three scales show considerable independence among the scales (russell1977evidence). Categorical emotion states can be represented in threedimensional (VAD) emotion space. Based on emotional dimensions, wordlevel VAD annotation of English words has been created. (bradley1999affective; Warriner2013) Recently, largescale annotation of VAD score annotation to English words is developed (mohammad2018obtaining), so we leverage this annotation scores for predicting sentencelevel VAD scores during training from categorical emotion annotation datasets.
Emotional Distribution Learning. Instead of predicting multiple emotion labels from text, learning emotion distribution itself from text has been proposed (deyu2016emotion). This approach maps text to emotion distribution and respective intensities incorporating Plutchik’s wheel of emotions. Furthermore, distribution learning can be extended to issues of emotion ranking. (zhou2018relevant) Unlike previous approach, our model learns decomposed emotional distributions, which is valence, arousal, dominance distribution of emotions.
8 Discussion and Conclusions
We propose learning to predict VAD scores from the text with categorical emotion annotations. Our framework predicts VAD score distributions for a given text rather can classification probabilities for each class, by minimizing the EMD distances between predicts VAD distributions and sorted label distributions as a proxy of target VAD distributions.
Learning conditional VAD distributions enables predicting categorical emotion classes and continuous VAD scores simultaneously. With finetuning pretrained BERTLarge on SemEval, our approach shows comparable performance in categorical emotion classification task and significant positive correlations with target VAD scores even without supervision of VAD scores. If our model continues supervised training on the VAD labels, our model outperforms stateoftheart VAD regression models. Ablation study shows this is because superiority of the multilayer Transformer architecture as well as effective initialization strategy of finetuning the model starting from our model for VAD score prediction. We further find nearest neighbor words from the predicted VAD scores of our model, which could be regarded as our model can automatically generate categorical emotion labels which are not be seen in training time to a corresponding input sentence.
We hope our framework would help researchers to build a humanannotated sentencelevel VAD emotion dataset by providing machineannotated VAD scores as a start, or use it just as VAD score prediction model. Most of the languages except English would not have such corpus with VAD annotations, so our model will be helpful to build a multilingual resource using multilingual corpora with categorical emotion labels. (ohmanetal2018creating) Also, further work will focus on developing a model giving more sensible VAD scores without VAD annotations.