Emotional Neural Language Generation Grounded in Situational Contexts

  • 2019-11-25 19:01:36
  • Sashank Santhanam, Samira Shaikh
  • 2

Abstract

Emotional language generation is one of the keys to human-like artificialintelligence. Humans use different type of emotions depending on the situationof the conversation. Emotions also play an important role in mediating theengagement level with conversational partners. However, current conversationalagents do not effectively account for emotional content in the languagegeneration process. To address this problem, we develop a language modelingapproach that generates affective content when the dialogue is situated in agiven context. We use the recently released Empathetic-Dialogues corpus tobuild our models. Through detailed experiments, we find that our approachoutperforms the state-of-the-art method on the perplexity metric by about 5points and achieves a higher BLEU metric score.

 

Quick Read (beta)

Emotional Neural Language Generation Grounded in Situational Contexts

Sashank Santhanam and Samira Shaikh
Department of Computer Science
University of North Carolina at Charlotte
Charlotte, NC, USA
{ssantha1,samirashaikh}@uncc.edu
Abstract

Emotional language generation is one of the keys to human-like artificial intelligence. Humans use different type of emotions depending on the situation of the conversation. Emotions also play an important role in mediating the engagement level with conversational partners. However, current conversational agents do not effectively account for emotional content in the language generation process. To address this problem, we develop a language modeling approach that generates affective content when the dialogue is situated in a given context. We use the recently released Empathetic-Dialogues corpus to build our models. Through detailed experiments, we find that our approach outperforms the state-of-the-art method on the perplexity metric by about 5 points and achieves a higher BLEU metric score.

Emotional Neural Language Generation Grounded in Situational Contexts


Sashank Santhanam and Samira Shaikh Department of Computer Science University of North Carolina at Charlotte Charlotte, NC, USA {ssantha1,samirashaikh}@uncc.edu

1 Introduction

Rapid advancement in the field of generative modeling through the use of neural networks has helped advance the creation of more intelligent conversational agents. Traditionally these conversational agents are built using seq2seq framework that is widely used in the field of machine translation (vinyals2015neural). However, prior research has shown that engaging with these agents produces dull and generic responses whilst also being inconsistent with the emotional tone of conversation (vinyals2015neural; li-etal-2016-persona). These issues also affect engagement with the conversational agent, that leads to short conversations (venkatesh2018evaluating). Apart from producing engaging responses, understanding the situation and producing the right emotional response to a that situation is another desirable trait (rashkin2019towards).

Emotions are intrinsic to humans and help in creation of a more engaging conversation (poria2019emotion). Recent work has focused on approaches towards incorporating emotion in conversational agents (asghar2018affective; zhou2018emotional; huang2018automatic; ghosh-etal-2017-affect), however these approaches are focused towards seq2seq task. We approach this problem of emotional generation as a form of transfer learning, using large pretrained language models. These language models, including BERT, GPT-2 and XL-Net, have helped achieve state of the art across several natural language understanding tasks (devlin-etal-2019-bert; radford2019language; yang2019xlnet). However, their success in language modeling tasks have been inconsistent (ziegler2019encoder). In our approach, we use these pretrained language models as the base model and perform transfer learning to fine-tune and condition these models on a given emotion. This helps towards producing more emotionally relevant responses for a given situation. In contrast, the work done by Rashkin et al. (rashkin2019towards) also uses large pretrained models but their approach is from the perspective of seq2seq task.

Our work advances the field of conversational agents by applying the transfer learning approach towards generating emotionally relevant responses that is grounded on emotion and situational context. We find that our fine-tuning based approach outperforms the current state of the art approach on the automated metrics of the BLEU and perplexity. We also show that transfer learning approach helps produce well crafted responses on smaller dialogue corpus.

2 Approach

Consider the example show in Table 1 that shows a snippet of the conversation between a speaker and a listener that is grounded in a situation representing a type of emotion. Our goal is to produce responses to conversation that are emotionally appropriate to the situation and emotion portrayed.

Emotion: Confident
Situation: I just knew I was going to do well at
work this morning.
Speaker: I just knew I was going to do well at
work this morning. I was prepared
Listener: That is the way to go! Keep it up!
Table 1: Example of conversations between a speaker and a listener

We approach this problem through a language modeling approach. We use large pre-trained language model as the base model for our response generation. This model is based on the transformer architecture and makes uses of the multi-headed self-attention mechanism to condition itself of the previously seen tokens to its left and produces a distribution over the target tokens. Our goal is to make the language model p(y)=p(y1,y2,.,yt;θ) learn on new data and estimate the conditional probability p(y|x). Radford et al. (radford2019language) demonstrated the effectiveness of language models to learn from a zero-shot approach in a multi-task setting. We take inspiration from this approach to condition our model on the task-specific variable p(yt|x,y<t), where x is the task-specific variable, in this case the emotion label. We prepend the conditional variable (emotion, situational context) to the dialogue similar to the approach from Wolf et al (wolf2019transfertransfo). We ensure that that the sequences are separated by special tokens.

3 Experiments

3.1 Data

In our experiments we use the Empathetic Dialogues dataset made available by Rashkin et al. (rashkin2019towards). Empathetic dialogues is crowdsourced dataset that contains dialogue grounded in a emotional situation. The dataset comprises of 32 emotion labels including surprised, excited, angry, proud, grateful. The speaker initiates the conversation using the grounded emotional situation and the listener responds in an appropriate manner11 1 More information about the dataset made available on the (rashkin2019towards).Table 2 provides the basic statistics of the corpus.

Train Valid. Test
Num. Conversations 19433 2770 2547
Utterances 84324 12078 10973
Avg Length
Conversations
4.31 4.36 4.31
Table 2: Statistics of Empathetic Dialogue dataset used in our experiments

3.2 Implementation

In all our experiments, we use the GPT-2 pretrained language model. We use the publicly available model containing 117M parameters with 12 layers; each layer has 12 heads. We implemented our models using PyTorch Transformers.22 2 https://github.com/huggingface/pytorch-transformers The input sentences are tokenized using byte-pair encoding(BPE) (sennrich-etal-2016-neural) (vocabulary size of 50263). While decoding, we use the nucleus sampling (p=0.9) approach instead of beam-search to overcome the drawbacks of beam search (holtzman2019curious; ippolito-etal-2019-comparison). All our models are trained on a single TitanV GPU and takes around 2 hours to fine-tune the model. The fine-tuned models along with the configuration files and the code will be made available at: https://github.com/sashank06/CCNLG-emotion.

3.3 Metrics

Evaluating the quality of responses in open domain situations where the goal is not defined is an important area of research. Researchers have used methods such as BLEU , METEOR (banerjee2005meteor), ROUGE (lin2004rouge) from machine translation and text summarization (dialogue-eval) tasks. BLEU and METEOR are based on word overlap between the proposed and ground truth responses; they do not adequately account for the diversity of responses that are possible for a given input utterance and show little to no correlation with human judgments (dialogue-eval). We report on the BLEU (papineni2002bleu) and Perplexity (PPL) metric to provide a comparison with the current state-of-the-art methods. We also report our performance using other metrics such as length of responses produced by the model. Following, Mei et al (mei17), we also report the diversity metric that helps us measure the ability of the model to promote diversity in responses (diversity2016). Diversity is calculated as the as the number of distinct unigrams in the generation scaled by the total number of generated tokens (mei17; li-etal-2016-persona). We report on two additional automated metrics of readability and coherence. Readability quantifies the linguistic quality of text and the difficulty of the reader in understanding the text (novikova-etal-2017-need). We measure readability through the Flesch Reading Ease (FRE) (kincaid1975derivation) which computes the number of words, syllables and sentences in the text. Higher readability scores indicate that utterance is easier to read and comprehend. Similarly, coherence measures the ability of the dialogue system to produce responses consistent with the topic of conversation. To calculate coherence, we use the method proposed by Dziri et al. (dziri2018augmenting).

4 Results

4.1 Automated Metrics

Experiment
Valid
PPL
BLEU Readability Coherence Length Diversity
Baseline
Fine-Tuned
(rashkin2019towards)
21.24 6.27 x x x x
Baseline
Emo-prepend
(rashkin2019towards)
24.30 4.36 x x x x
Our Model
Fine-Tuned
18.32 7.71 0.78 0.93 9.77 0.0031
Our Model
Emo-prepend
19.49 7.78 0.79 0.93 9.71 0.0033
Table 3: Comparison of the performance of our model to the baseline model proposed by Rashkin et al (rashkin2019towards) across a variety of automated metrics to provide a thorough comparison. x indicates that these metrics were not provided in the Rashkin et al (rashkin2019towards) work.

We first compare the performance of our approach with the baseline results obtained from Rashkin et al. (rashkin2019towards) that uses a full transformer architecture (vaswani2017attention), consisting of an encoder and decoder. Table 3 provides a comparison of our approach with to the baseline approach. In Table 3, we refer our “Our Model Fine-Tuned” as the baseline fine-tuned GPT-2 model trained on the dialogue and “Our-model Emo-prepend” as the GPT-2 model that is fine-tuned on the dialogues but also conditioned on the emotion displayed in the conversation. We find that fine-tuning the GPT-2 language model using a transfer learning approach helps us achieve a lower perplexity and a higher BLEU scores. The results from our approach are consistent with the empirical study conducted by Edunov et al (edunov-etal-2019-pre) that demonstrate the effectiveness of the using pre-trained model diminishes when added to the decoder network in an seq2seq approach. We also perform a comparison between our two models on the metrics of length, diversity, readability and coherence. We find that our baseline model produces less diverse responses compared to when the model is conditioned on emotion. We find that the our emo-prepend model also higher a slightly higher readability score that our baseline model.

4.2 Qualitative Evaluation

To assess the quality of generations, we conducted a MTurk human evaluation. We recruited a total of 15 participants and each participant was asked to evaluate 25 randomly sampled outputs from the test set on three metrics:

  1. 1.

    Readability - Is the response easy to understand, fluent and grammatical and does not have any consecutive repeating words.

  2. 2.

    Coherence - Is the response relevant to the context of the conversation.

  3. 3.

    Emotional Appropriateness- Does the response convey emotion suitable to the context of the conversation?

Input
Conversation
Model Outputs
Person A: I think I aced my supervisor
exam!
Ground
Truth
Woo hoo! That is awesome! Congrats! When will you find out
Fine
Tuned
What was it?
Emo
Prepend
Wow, that’s impressive!
Person A: I had pizza at a restaurant a
couple days ago and found a hair on my
pizza!
Ground
Truth
It is disgusting. What did you do?
Fine
Tuned
I was so angry
Emo
Prepend
That’s terrible, did you clean it up?
Table 4: Example generations from our two model along with the ground truth responses.
Readability Coherence
Emotional
Appropriate-
ness
Our Model
Fine-Tuned
4.14 3.50 3.70
Our Model
Emo-prepend
3.54 3.4 3.19
Ground
Truth
3.92 3.86 4
Table 5: Human ratings demonstrating a comparison between our models to the ground truth responses on the metrics of readability, coherence and emotional appropriateness

Table 5 shows the results obtained from the human evaluation comparing the performance of our fine-tuned, emotion pre-pend model to the ground-truth response. We find that our fine-tuned model outperforms the emo-prepend on all three metrics from the ratings provided by the human ratings.

5 Related Work

The area of dialogue systems has been studied extensively in both open-domain (niu2018polite) and goal-oriented (lipton2018bbq) situations. Extant approaches towards building dialogue systems has been done predominantly through the seq2seq framework (vinyals2015neural). However, prior research has shown that these systems are prone to producing dull and generic responses that causes engagement with the human to be affected (vinyals2015neural; venkatesh2018evaluating). Researchers have tackled this problem of dull and generic responses through different optimization function such as MMI (li-etal-2016-diversity) and through reinforcement learning approaches(Li-reinforce). Alternative approaches towards generating more engaging responses is by grounding them in personality of the speakers that enables in creating more personalized and consistent responses (li-etal-2016-persona; zhang-etal-2018-personalizing; wolf2019transfertransfo).

Several other works have focused on creating more engaging responses by producing affective responses. One of the earlier works to incorporate affect through language modeling is the work done by Ghosh et al. (ghosh-etal-2017-affect). This work leverages the LIWC (pennebaker2001linguistic) text analysis platform for affective features. Alternative approaches of inducing emotion in generated responses from a seq2seq framework include the work done by Zhou et al(zhou2018emotional) that uses internal and external memory, Asghar et al. (asghar2018affective) that models emotion through affective embeddings and Huang et al (huang2018automatic) that induce emotion through concatenation with input sequence. More recently, introduction of transformer based approaches have helped advance the state of art across several natural language understanding tasks (vaswani2017attention). These transformers models have also helped created large pre-trained language models such as BERT (devlin-etal-2019-bert), XL-NET (yang2019xlnet), GPT-2 (radford2019language). However, these pre-trained models show inconsistent behavior towards language generation (ziegler2019encoder).

6 Conclusion and Discussion

In this work, we study how pre-trained language models can be adopted for conditional language generation on smaller datasets. Specifically, we look at conditioning the pre-trained model on the emotion of the situation produce more affective responses that are appropriate for a particular situation. We notice that our fine-tuned and emo-prepend models outperform the current state of the art approach relative to the automated metrics such as BLEU and perplexity on the validation set. We also notice that the emo-prepend approach does not out perform a simple fine tuning approach on the dataset. We plan to investigate the cause of this in future work from the perspective of better experiment design for evaluation (santhanam2019towards) and analyzing the models focus when emotion is prepended to the sequence (clark2019does). Along with this, we also notice other drawbacks in our work such as not having an emotional classifier to predict the outcome of the generated sentence, which we plan to address in future work.

Acknowledgments

This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No FA8650-18-C-7881. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of AFRL, DARPA, or the U.S. Government. We thank the anonymous reviewers for the helpful feedback.

References