Abstract
Despite the effectiveness of sequencetosequence framework on the task ofShortText Conversation (STC), the issue of underexploitation of training data(i.e., the supervision signals from query text is \textit{ignored}) stillremains unresolved. Also, the adopted \textit{maximization}based decodingstrategies, inclined to generating the generic responses or responses withrepetition, are unsuited to the STC task. In this paper, we propose toformulate the STC task as a language modeling problem and tailormake atraining strategy to adapt a language model for response generation. To enhancegeneration performance, we design a relevancepromoting transformer languagemodel, which performs additional supervised source attention after theselfattention to increase the importance of informative query tokens incalculating the tokenlevel representation. The model further refines the queryrepresentation with relevance clues inferred from its multiple referencesduring training. In testing, we adopt a\textit{randomizationovermaximization} strategy to reduce the generation ofgeneric responses. Experimental results on a large Chinese STC datasetdemonstrate the superiority of the proposed model on relevance metrics anddiversity metrics.\footnote{Code available athttps://ai.tencent.com/ailab/nlp/dialogue/.
Quick Read (beta)
RelevancePromoting Language Model for ShortText Conversation^{†}^{†}thanks: The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). It was mainly done when Xin Li was an intern at Tencent AI Lab.
Abstract
Despite the effectiveness of sequencetosequence framework on the task of ShortText Conversation (STC), the issue of underexploitation of training data (i.e., the supervision signals from query text is ignored) still remains unresolved. Also, the adopted maximizationbased decoding strategies, inclined to generating the generic responses or responses with repetition, are unsuited to the STC task. In this paper, we propose to formulate the STC task as a language modeling problem and tailormake a training strategy to adapt a language model for response generation. To enhance generation performance, we design a relevancepromoting transformer language model, which performs additional supervised source attention after the selfattention to increase the importance of informative query tokens in calculating the tokenlevel representation. The model further refines the query representation with relevance clues inferred from its multiple references during training. In testing, we adopt a randomizationovermaximization strategy to reduce the generation of generic responses. Experimental results on a large Chinese STC dataset demonstrate the superiority of the proposed model on relevance metrics and diversity metrics.^{1}^{1} 1 Code available at https://ai.tencent.com/ailab/nlp/dialogue/.
L[1]¿\arraybackslashm#1
RelevancePromoting Language Model for ShortText Conversation^{†}^{†}thanks: The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). It was mainly done when Xin Li was an intern at Tencent AI Lab.
Xin Li,^{1} Piji Li,^{2} Wei Bi,^{2} Xiaojiang Liu,^{2} Wai Lam^{1} ^{1}Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong ^{2}Tencent AI Lab, Shenzhen, China {lixin, wlam}@se.cuhk.edu.hk, {pijili, victoriabi, kieranliu}@tencent.com
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Short Text Conversation (STC) (?), also known as singleturn chitchat conversation, is a popular research topic in the field of natural language processing. It is usually formulated as a sequence translation problem (?; ?) and the sequencetosequence encoderdecoder (Seq2Seq) framework (?; ?; ?) is applied for solving this problem. The decoder generates the responses tokenbytoken, conditioned on the compressed query representations from the encoder. Following this paradigm, many attempts have been conducted to refine the quality of the generated responses (?; ?; ?; ?).
Despite the effectiveness of these efforts, some intrinsic issues of Seq2Seqbased models still hinder further improvement of generation performance. Under the Seq2Seq formulation, the autoregressive decoder is only trained on the goldstandard response text while the query text is ignored, leading to underexploitation of the training data. Besides, the maximizationbased decoding strategies adopted in existing models, such as beam search and greedy search, restrict the search space to the most frequent phrases and thus they have the tendency to generate the generic responses or repetitive responses with unnaturally high likelihood, degrading the conversational experience.
GPT2 (?), a recently proposed Transformerbased language model, provides an alternative solution for language generation. One advantage of GPT2 is that the transformer language model can not only capture the context of arbitrary length but also make full use of the textual supervision signals because the generator is actually the language model itself. Moreover, GPT2 adopts topk sampling (?) to diversify the generated texts while preserving the relevance. Obviously, these characteristics are attractive and meaningful for solving the STC task, whose aim is to generate informative and diverse humanlike responses given the user queries.
However, due to the essence of language modeling, directly applying GPT2 on the STC task, a conditional language generation task, may be insufficient because the language model is unable to discriminate the source (query) sentence and the target (response) sentence. The original experimental results of GPT2 on the abstractive summarization task (?) also verify this claim. Another potential issue of adapting language model for the STC task comes from recency bias (?) and explanationaway effects (?; ?), where the language model has the tendency to rely overly on the immediate context and explain away from the longterm context^{2}^{2} 2 Longterm context in language model is roughly equivalent to the source information in Seq2Seq framework., yielding fluent but topically irrelevant responses.
With the motivation of inheriting the merits of transformer language model while alleviating the potential issues under the language model formulation, we carefully design a training strategy to adapt the autoregressive transformerbased language model^{3}^{3} 3 Without explicit specification, the language model in our paper refers to the “autoregressive” language model, which is different from those “autoencoding” language models (?; ?). for the conditional response generation. First of all, it is observed that the dialog conversation is actually a process of text continuation, in other words, giving the response right after the query. Based on this observation, we can regard the STC task as a language modeling problem on the concatenated sequence of query and response. To discriminate the generation of query tokens and that of response tokens, we inject a special token between query and response, acting as the trigger of response generation. With this formulation, the language model based training objective can make use of the textual data from query, alleviating the underexploitation issue mentioned above.
Since the transformerbased language model tends to focus on the shortterm context and ignore the longterm context, namely, the explanation away issue, we propose to empower the selfattention with encoderdecoder attention, which enforces the model to pay additional attention to the query, especially the query tokens of user interest, and guides the model to rely on informative query tokens to make good predictions. It is also observed that some response tokens not mentioned in the query are still closely related to the discussed topic in the conversation. In order to exploit such kind of relevance clues hidden behind the responses, we propose a topic inference component to learn a compact source (query) representation encoding the information relevant to the query and feed the query representation into each generation step, encouraging the language model to consider the generation of the topic words potentially related to the query.
As with the decoding strategy, different from the existing STC models, we propose to decode with randomizationovermaximization method, namely, the topk sampling, from the transformer language model to generate the relevant response with high originality.
In summary, our contributions are as follows:
$\bullet $ We tailormake a training strategy to adapt the transformerbased language model for the Short Text Conversation (STC) task.
$\bullet $ We propose two components, namely, Supervised Source Attention (SSA) component and Topic Inference (TI) component to promote the relevance modeling in the language model based response generator.
$\bullet $ To the best of our knowledge, we are the first to introduce topk sampling, a randomizationovermaximization strategy, for diverse response generation.^{4}^{4}
4
We notice that some concurrent works (?; ?; ?) also adopt the strategy similar to ours after the submission.
Model
Overview
In our language model formulation, each training queryresponse pair and the special tokens are concatenated as a single sequence $\mathbf{x}=\{{x}_{1},\mathrm{\cdots},{x}_{m},{x}_{m+1},\mathrm{\cdots},{x}_{n}\}$ of length $n$. ${\mathbf{x}}_{1:m}$ corresponds to the query token sequence of length $m$ and ${x}_{m}$ is the special token [EOQ], denoting the end of query. ${\mathbf{x}}_{m+1:n}$ corresponds to the response and ${x}_{n}$ is [EOS], the end symbol of the whole sequence. The training objective of our model is to maximize the unconditional likelihood $p({\mathbf{x}}_{1:n})$, similar to the existing language models (?; ?).
The architecture of our model is depicted in Fig 2, where $L$ decoderonly transformer layers (?)^{5}^{5} 5 For the technical details of transformer, we recommend the reader to read the paper (?). are involved. Different from the original transformer layer solely containing the selfattention component, the transformer layer in our model is further empowered with the proposed supervised source attention (SSA) component. The outputs of the $l$th transformer layer are the contextualized token representations of size ${\mathrm{dim}}_{h}$, denoted as ${\mathbf{H}}^{l}\in {\mathbb{R}}^{n\times {\mathrm{dim}}_{h}}$. When predicting the tokens, a Topic Inference (TI) component is introduced to provide the refined query representations encoding the topic information inferred from the reference.
Language Model as Response Generator
To achieve the goal of adapting language model for the STC task, we should carefully design a training strategy different from that in the Seq2Seq framework. Based on the observation that the human conversations can be regarded as a process of text continuation (i.e., giving the response/answer right after the query/question), we concatenate the query token sequence and the response token sequence into a single sequence and formulate the STC task as a contextual text continuation problem. One input example of our model is illustrated in Fig 1. The training goal of the model is to minimize the joint negative log likelihood over the whole sequence:
$$  (1) 
Obviously, it is easy to bridge the gap between the taskspecific training and the autoregressive pretraining (?; ?; ?) because the formulations of their objectives are almost the same. Another advantage of this language model formulation is that it takes the likelihood of query tokens into consideration, which is ignored in the existing works (?; ?). Intuitively, the text generated by the language model is more fluent than those generated by Seq2Seq framework because the generator of the language model (the language model itself) is not only trained on the response sentence but also the query sentence.
Relevance Modeling Component
The vanilla transformer decoder is equipped with selfattention (?; ?) and can theoretically capture the context of arbitrary length. Given the input ${\mathbf{H}}^{l1}\in {\mathbb{R}}^{n\times {\mathrm{dim}}_{h}}$, the contextualized representations ${\mathbf{h}}_{t}^{l}$ ($l\in [1,L]$, $t\in [1,n]$) at the $t$th time step is built as follows:
$$\begin{array}{cc}\hfill {\mathbf{h}}_{t}^{l},{\bm{\alpha}}_{t}^{l}& =\mathrm{\text{SlfAtt}}({\mathbf{q}}_{t}^{l1},{\mathbf{K}}_{\le t}^{l1},{\mathbf{V}}_{\le t}^{l1})\hfill \\ \hfill {\mathbf{Q}}^{l1}& ={\mathbf{H}}^{l1}{\mathbf{W}}^{Q}\hfill \\ \hfill {\mathbf{K}}^{l1},{\mathbf{V}}^{l1}& ={\mathbf{H}}^{l1}{\mathbf{W}}^{K},{\mathbf{H}}^{l1}{\mathbf{W}}^{V}\hfill \end{array}$$  (2) 
where SlfAtt is the selfattention layer^{6}^{6} 6 The symbols for the feedforward layer and residual connections are not shown. and ${\bm{\alpha}}_{t}^{l}\in {\mathbb{R}}^{t}$ is the calculated attention vector. $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}\in {\mathbb{R}}^{n\times {\mathrm{dim}}_{h}}$ respectively denote the query^{7}^{7} 7 Here, the “query” refers to a realvalued vector while the “query” in the STC task is a sentence., key and value in the selfattention layer. ${\mathbf{K}}_{\le t}^{l1}=\{{\mathbf{k}}_{1}^{l1},\mathrm{\cdots},{\mathbf{k}}_{t}^{l1}\}$ indicate the leftward elements and the same to ${\mathbf{V}}_{\le t}^{l1}$. Despite its capability of learning global dependency, the transformerbased language model still has the tendency to overly rely on the shortterm context and ignore the longterm context when predicting the next word, dubbed as explanation away problem (?). This problem is catastrophic for the STC task because the query acts as the longterm context in our language model formulation and not involving the query information is prone to generating the content irrelevant to the query. Therefore, explicitly modeling the relevance and emphasizing the importance of the query are essential. In this paper, we propose two components, namely, Supervised Source Attention (SSA) and Topic Inference (TI), to handle the explanation away problem.
Supervised Source Attention
In the existing Seq2Seqbased frameworks, incorporating the query/source information is achieved by applying encoderdecoder attention solely on the encoder hidden representations. Similarly, attending only on the longterm context of language model is presumably beneficial for improving the relevance. Therefore, we propose to introduce another source attention layer on top of the selfattention layer. The computational formula of the ${t}^{\prime}$th (${t}^{\prime}\ge m$) queryenhanced hidden representation ${\widehat{\mathbf{h}}}_{{t}^{\prime}}^{l}$ is below:
$$\begin{array}{cc}\hfill {\widehat{\mathbf{h}}}_{{t}^{\prime}}^{l},{\bm{\beta}}_{{t}^{\prime}}^{l}& =\mathrm{\text{SrcAtt}}({\widehat{\mathbf{q}}}_{{t}^{\prime}}^{l},{\widehat{\mathbf{K}}}^{l},{\widehat{\mathbf{V}}}^{l})\hfill \\ \hfill {\widehat{\mathbf{Q}}}^{l}& ={\mathbf{H}}^{l}{\mathbf{W}}^{Q}\hfill \\ \hfill {\widehat{\mathbf{K}}}^{l},{\widehat{\mathbf{V}}}^{l}& ={\mathbf{H}}_{1:m}^{l}{\mathbf{W}}^{K},{\mathbf{H}}_{1:m}^{l}{\mathbf{W}}^{V}\hfill \end{array}$$  (3) 
SrcAtt refers to our source attention layer on top of the selfattention layer. ${\bm{\beta}}_{{t}^{\prime}}^{l}\in {\mathbb{R}}^{m}$ is the attention scores for the corresponding hidden representations of the query tokens. ${\mathbf{H}}^{l}$ is the output of SlfAtt layer and ${\widehat{\mathbf{Q}}}^{l}\in {\mathbb{R}}^{n\times {\mathrm{dim}}_{h}}$, ${\widehat{\mathbf{K}}}^{l}$, ${\widehat{\mathbf{V}}}^{l}\in {\mathbb{R}}^{m\times {\mathrm{dim}}_{h}}$ are the corresponding query, key, value in the source attention. Note that we only additionally apply source attention when the current token is not query token, i.e., ${t}^{\prime}\ge m$, and do nothing in the preceding steps. Learning word alignment from data is possible but may be inaccurate without any supervision or external knowledge (?; ?), therefore, we employ the keywords as the knowledge and enforce the source attention component to be concentrated on the important query tokens. First of all, we perform maxovertime pooling over the attention vectors ${\bm{\beta}}_{{t}^{\prime}}^{l}\in {\mathbb{R}}^{m}$ (${t}^{\prime}\in [m+1,n]$) and induce the vector ${\widehat{\mathbf{y}}}^{\mathrm{src}}\in {\mathbb{R}}^{m}$ reflecting the salience scores of the query/source tokens:
$${\widehat{\mathbf{y}}}_{i}^{\mathrm{src}}=\mathrm{max}\{{\bm{\beta}}_{m+1,i}^{L},\mathrm{\cdots},{\bm{\beta}}_{n,i}^{L}\},i\in [1,m]$$  (4) 
Then, given the query keyword indicator vector ${\mathbf{y}}^{\mathrm{src}}\in {\{0,1\}}^{m}$, we introduce additional source attention loss ${\mathcal{L}}^{src}$ into Eq (1):
$${\mathcal{L}}^{\mathrm{src}}=\frac{1}{m}{{\widehat{\mathbf{y}}}_{i}^{\mathrm{src}}{\mathbf{y}}^{\mathrm{src}}}_{2}^{2}$$  (5) 
Ideally, the generation process will rely on more important query tokens if the salience score ${\widehat{\mathbf{y}}}^{\mathrm{src}}$ is more close to the keyword vector ${\mathbf{y}}^{\mathrm{src}}$.
Topic Inference
The SSA component attempts to improve the relevance by highlighting the importance of the important query tokens/words in the attention process. However, the range of the words topically related to the query is far more than that of the keywords explicitly mentioned in the query. Considering the query “what is your favorite fruit?” and two valid responses “I like the watermelon very much” and “My favorite fruit is pineapple”, “fruit” should be emphasized during the generation but the words used to discuss fruit such as “watermelon” and “pineapple” are also very meaningful for building a response. Inspired by this, we collect the multiple references of each query in the training set and gather all of the keywords extracted from such responses^{8}^{8} 8 (?) extend the keyword set using external corpus. Here, we focus on improving the relevance rather than enriching the topical words in the response, thus, we only utilize the training data to explore more keywords.. To exploit the latent topic information, we introduce Topic Inference (KI) component to estimate the global topical word distribution based on the query representation ${\mathbf{h}}^{q}$ as follows:
$$\begin{array}{c}\hfill {\mathbf{h}}^{q}=f({\mathbf{x}}_{1:m}),P(z{\mathbf{x}}_{1:m})=\text{Softmax}({\mathbf{W}}^{o}{\mathbf{h}}^{q})\end{array}$$  (6) 
where $f:{\mathbb{R}}^{m}\to {\mathbb{R}}^{{\mathrm{dim}}_{h}}$ denotes the function mapping the input query tokens to a lowdimensional query representation. Specifically, we feed the last query hidden representation in the transformer, namely, ${\mathbf{h}}_{m}^{L}$, into a linear layer with tanh activation and regard the output as the query representation ${\mathbf{h}}^{q}$ for simplifying the modeling part. To encode the topic information into the query representation, we employ the global keyword indicator vector ${\mathbf{y}}^{\mathrm{kwd}}\in {\{0,1\}}^{\mathcal{V}}$ as supervision signals and enforce the components corresponding to keywords/important tokens in the querybased global topic distribution to be upweighted. The computational formula is as follows:
$${\mathcal{L}}^{\mathrm{kwd}}=\frac{1}{\mathcal{V}}\sum _{i=1}^{\mathcal{V}}{\mathbf{y}}_{i}^{\mathrm{kwd}}\cdot \mathrm{log}{P}_{i}(z{\mathbf{x}}_{1:m})$$  (7) 
where the subscript $i$ denotes the $i$th component of a vector and $\mathcal{V}$ is the vocabulary size. Note that we attempt to replace the Softmax in Eq 6 with the componentwise Sigmoid, typically used in multilabel classification problem, but the empirical results become worse. Thus, we keep the Softmax probability function unchanged in the experiment. Similar to Eq 5, the ${\mathcal{L}}^{\mathrm{kwd}}$ will be added in the training loss.
Different from (?) and (?) regarding the concrete topic/keyword as the trigger of generation, we introduce the query representation encoding the global topic information as the supplementation for each tokenlevel representation to encourage the generation of the relevant topical words. The representation vector ${\mathbf{s}}_{t}$ for predicting the output is calculated below:
$$\begin{array}{cc}\hfill {\mathbf{s}}_{t}& =\{\begin{array}{cc}\hfill (1{g}_{t})*{\mathbf{h}}_{t}^{L}+{g}_{t}*{\mathbf{h}}^{q}\hfill & \text{, if}tm\hfill \\ \hfill {\mathbf{h}}_{t}^{L}\hfill & \text{, Otherwise}\hfill \end{array}\hfill \\ \hfill {g}_{t}& =\sigma ({\mathbf{W}}^{g}{\mathbf{h}}^{q}+{\mathbf{W}}^{l}{\mathbf{h}}_{t}^{L}+\mathbf{b}),\hfill \end{array}$$  (8) 
where ${g}_{t}\in {\mathbb{R}}^{{\mathrm{dim}}_{h}}$ is the gate value and ${\mathbf{W}}^{g},{\mathbf{W}}^{l}\in {\mathbb{R}}^{{\mathrm{dim}}_{h}\times {\mathrm{dim}}_{h}}$ are parameter matrices in the TI component.
Model Training
The proposed SSA component and the TI component are jointly trained with the transformerbased language model. Based on Eq 1, Eq 5 and Eq 7, the overall training objective $\mathcal{L}(\theta )$ of the proposed model is as follow:
$$\begin{array}{cc}& \mathcal{L}(\theta )=\frac{1}{\mathbb{D}}\sum _{(\mathbf{x},{\mathbf{y}}^{\mathrm{src}},{\mathbf{y}}^{\mathrm{kwd}})\in \mathbb{D}}\mathcal{L}(\mathbf{x},{\mathbf{y}}^{\mathrm{src}},{\mathbf{y}}^{\mathrm{kwd}})\hfill \\ & \mathcal{L}(\mathbf{x},{\mathbf{y}}^{\mathrm{src}},{\mathbf{y}}^{\mathrm{kwd}})={\mathcal{L}}^{\mathrm{mle}}+{\gamma}_{1}{\mathcal{L}}^{\mathrm{src}}+{\gamma}_{2}{\mathcal{L}}^{\mathrm{kwd}}\hfill \end{array}$$  (9) 
Here, ${\gamma}_{1}$ and ${\gamma}_{2}$ are the coefficients controlling the proportion of ${\mathcal{L}}^{\mathrm{src}}$ and ${\mathcal{L}}^{\mathrm{kwd}}$ involved in the training respectively.
Decoding
Due to the limited search space, it is difficult for the beam search or greedy search to find the interesting and diverse responses. Therefore, we do not adopt them but a “randomizationovermaximization” strategy (also know as ‘topk sampling”) to perform the decoding, as done in (?; ?). (?) and (?) explore the usage of other advanced decoding strategies in the language generation task. Since our aim in this paper is not to compare the performances across the different decoding strategies, we consistently use the topk sampling.
Experiment
Experiment Setup
We utilize the benchmark STC dataset (?) to evaluate the effectiveness of the proposed relevancepromoting transformer language model. This dataset is built based on the real conversations from Weibo^{9}^{9} 9 https://www.weibo.com/ and contains about 7M highquality queryresponse pairs. We split the dataset such that #train:#dev:#test is 7,024,156:2,000:800. Training details are provided in the appendix.
To avoid word segmentation errors and outofvocabulary issue, the tokens in our model and the baseline models are Chinese characters and the vocabulary size is about 12,000.
Evaluation Metrics
We introduce the following metrics to evaluate the model’s capability of generating relevant and diverse responses:
Relevance Metrics We employ Bleu2, Bleu3 & Bleu4 (?) to estimate the relevance of the generated responses. Moreover, we also design two more metrics, namely, Hitq and Hitr to calculate the hit rates of the topical words in the query and the response respectively. Firstly, we build a highprecisionlowrecall keyword set for each query/response sentence based on keyword extraction toolkit^{10}^{10} 10 https://github.com/fxsjy/jieba and filter some noisy words based on additional handcrafted rules. Then, we calculate the HitQ${}_{i}$ and HitR${}_{i}$ for the $i$th predictions as follows:
$${\mathrm{\text{Hitq}}}_{i}=\frac{{\mathbb{K}}^{{r}_{i}}\cap {\mathbb{K}}^{{q}_{i}}}{{\mathbb{K}}^{{r}_{i}}},{\mathrm{\text{Hitr}}}_{i}=\frac{{\mathbb{K}}^{{r}_{i}}\cap {\mathbb{K}}^{{r}_{i}^{g}}}{{\mathbb{K}}^{{r}_{i}}}$$  (10) 
where ${\mathbb{K}}^{{q}_{i}}$, ${\mathbb{K}}^{{r}_{i}}$ and ${\mathbb{K}}^{{r}_{i}^{g}}$ respectively denote the topical word set for the $i$th query, predicted response and gold standard response. Then we obtain the Hitq and Hitr by performing the corpuslevel average:
$$\mathrm{\text{Hitq}}=\frac{1}{N}\sum _{i}^{N}{\mathrm{\text{Hitq}}}_{i},\mathrm{\text{Hitr}}=\frac{1}{N}\sum _{i}^{N}{\mathrm{\text{Hitr}}}_{i}$$  (11) 
Diversity Metrics Following (?), we employ Dist1 and Dist2 to calculate the ratios of the distinct unigrams and bigrams in the generated responses.
Human Evaluations We also conduct human evaluations. Specifically, we randomly sampled 100 queries and recruit five helpers to judge Relevance (4scale rating, 03), Fluency (3scale rating, 02) and Acceptance (0 or 1) of the generated responses from our model and the baselines. Details of the rating criteria are stated in the appendix.
Comparison Models

•
LSTMLM (?): LSTMbased autoregressive language model armed with incremental selfattention. We train LSTMLM using the same strategy mentioned in this paper.

•
LSTMS2S: Attentionbased LSTM SequencetoSequence model.

•
TFMS2S: Transformer SequencetoSequence model where the network components are identical to those in (?).

•
TFMLM: Transformerbased autoregressive language model. We train TFMLM using the same strategy mentioned in this paper.

•
MMI (?): LSTMS2S with Maximum Mutual Information objective in decoding. In this paper, we set the number of responses for reranking as 50.

•
CVAE (?)^{11}^{11} 11 https://github.com/snakeztc/NeuralDialogCVAE: Conditional Variational AutoEncoder for response generation. We replace the dialogue acts used in the original model with the keywords extracted from the references.

•
MMPMS (?): The model with the stateoftheart performance on the STC task. We rerun the officially released code^{12}^{12} 12 https://github.com/PaddlePaddle/models to obtain the results on our dataset.
\Xhline3 Model  Relevance  Diversity  
Bleu2  Bleu3  Bleu4  HitQ  HitR  Dist1  Dist2  
LSTMLM  3.8  0.9  0.3  0.084  0.066  0.028  0.094 
LSTMS2S  5.6  2.8  1.8  0.293  0.145  0.039  0.137 
TFMLM  6.9  3.2  2.1  0.295  0.144  0.058  0.259 
TFMS2S  7.3  3.5  2.3  0.369  0.172  0.078  0.290 
MMI  7.9  2.5  1.0  0.197  0.145  0.093  0.349 
CVAE  5.8  1.5  0.4  0.211  0.135  0.060  0.211 
MMPMS  6.7  3.0  1.8  0.151  0.102  0.057  0.220 
OURStk w/o SSA & TI  4.9  1.0  0.3  0.119  0.076  0.086  0.441 
OURStk w/o SSA  5.5  2.1  1.5  0.150  0.146  0.102  0.521 
OURStk w/o TI  5.1  2.1  1.4  0.171  0.132  0.090  0.445 
OURSbm  10.3  5.3  3.4  0.510  0.193  0.102  0.398 
OURStk  6.0  3.6  2.5  0.191  0.152  0.107  0.544 
\Xhline3 
\Xhline3 Model  Evaluation Metrics  
Relevance  Fluency  Acceptance  
LSTMLM  1.206  1.297  0.26 
LSTMS2S  1.386  1.285  0.37 
TFMLM  1.412  1.328  0.39 
TFMS2S  1.475  1.306  0.43 
MMI  1.432  1.301  0.34 
CVAE  1.316  1.274  0.33 
MMPMS  1.528  1.396  0.42 
OURStk w/o SSA & TI  1.273  1.368  0.28 
OURStk w/o SSA  1.485  1.407  0.39 
OURStk w/o TI  1.503  1.303  0.36 
OURSbm  1.515  1.359  0.38 
OURStk  1.606  1.346  0.44 
\Xhline3 
Main Results
Table 1 and 2 list the automatic evaluation results and the human evaluation results respectively. In terms of Bleu, the proposed model with beam search decoding, namely, OURSbm, consistently achieve the best scores. Besides, OURSbm outperforms all compared models on the keywordoverlappingbased Hit metrics, suggesting that our model, armed with Supervised Source Attention component (SSA) and Topic Inference (TI) component, is beneficial for the generation of informative topical words related to the query. Surprisingly, OURSbm also obtains better Dist metrics than the baseline models. After replacing the beam search with topk sampling, our model (OURStk) is further enhanced in diversity modeling, reaching 0.107 and 0.544 on Dist1 and Dist2 respectively.
Regarding the more reliable human evaluations, both of OURSbm and OURStk are the topranked models. Specifically, despite its unsatisfactory results on the automatic Bleu and Hit metrics, OURStk performs the best on the manually annotated Relevance metric with 5% improvement over the current stateoftheart MMPMS model. Instead, OURSbm, the best model on the automatic relevance metrics, still yields competitive results on the Relevance. It is reasonable because some words not appearing in the query/references, especially those not being frequently used, are still related to the discussed topic in the conversations. At the same time, such inconsistency between automatic and human evaluations demonstrates the effectiveness of topk sampling, a randomizationovermaximization decoding strategy, in discovering infrequent but meaningful patterns for the STC task.
We now turn to discuss the performance of the other compared methods. Inheriting the powerful modeling capability of Transformer, TFMS2S obtains the best automatic relevance scores as well as the second best Relevance among the baselines. TFMLM, another Transformerbased baseline following the language model formulation in our paper, performs not as good as TFMS2S on all of the metrics except Fluency, verifying the postulation that the explanation away issue of language model has the tendency to produce fluent but topically irrelevant responses. Despite of this, the TFMLM outperforms LSTMLM and LSTMS2S, proving the superiority of Transformer to LSTM in response generation. Owing to the reranking mechanism, the MMI model is the strongest baseline on diversity modeling but OURSbm/OURStk still achieves approximately 14%/55% improvement on Dist2.
Ablation Study
In order to track the source of the performance gains, we also conduct the ablation study on the OURStk. The corresponding automatic and human evaluation results are shown in the second group of Table 1 and Table 2. As expected, the model without relevancepromoting design, i.e., OURStk w/o SSA & TI, is the worst one on the relevance metrics. OURSk w/o SSA and OURStk w/o TI, the variants incorporating either TI or SSA for relevance modeling, boost the Relevance score by $\sim $17% and $\sim $18% respectively. Although they are comparable on the relevance metrics but the former achieves higher diversity scores (Dist2: 0.521 v.s. 0.441). We attribute this phenomenon to the TI component, which exploits the usage of more related topical words mentioned in the multiple references. With the help of both SSA component and TI component, OURStk becomes the best model on Relevance and Dist metrics, demonstrating the necessity of the relevance modeling for the transformer language model. Another interesting finding is that the SSA component decreases the Fluency score (see the results of OURStk w/o TI), which indicates that fighting against explanationaway issue by incorporating additional query context may be coupled with corrupting the language model.
Case Study
Figure 3 shows example responses generated by our model and the most competitive baseline models. OURStk, which explicitly incorporates the query context and exploits the tokens potentially related to the query, always produces meaningful and informative responses. Taking the Query #1 & #2 as examples, the generated responses accurately respond to the query because they mention “flower ladder”/“matcha” and “cream”, which are exactly the topics discussed in the conversations. The response for the Query #3 can easily engage user in the conversation and thus it is also a meaningful prediction. The outputs of TFMLM are generally fluent. However, due to the explanation away issue, TFMLM tends to generate the irrelevant response (Case #1) or response with phrase repetition (Case #2). Under the sequencetosequence formulation, TFMS2S obtains the responses moderately related to the corresponding queries although the third output, directly copying part of the source text (i.e., query), is still unsatisfactory. MMPMS and MMI, the models aiming for promoting diversity, have chances to yield irrelevant responses.
Further Discussions on Topk Sampling
We further investigate the impact of topk sampling on the STC models. Firstly, we conduct additional automatic and human evaluations on the baseline models with results shown in Table 3. As can be seen, the topk sampling consistently improves the Dist2 score by a large margin on all models but the Relevance scores of LSTMS2S, TFMLM and TFMS2S decrease after topk sampling is applied. The variation trends of Fluency across the evaluated models are also inconsistent. These observations suggest that topk sampling is simple yet effective to achieve diverse response generation but it should be carefully utilized in the model because of its uncertainty on relevance and fluency.
As discussed in Case Study, the transformerbased models adopting beam search have the tendency to generate the responses with repetition and those directly copying the query. We here investigate whether topk sampling can help solve these issues. Figure 4 depicts the ratios of responses in the test set falling into the phrase repetition and query copy. The topk sampling greatly reduces the query copy rate (about 72% on average) and almost eliminates the phrase repetition phenomenon in the Transformerbased models. However, note that Table 3 shows both TFMLM and TFMS2S perform worse on Relevance after using topk sampling. We consider these results are consistent with human perception because enriching the morphology via samplingbased decoding strategy will inevitably introduce irrelevant information, leading to the degradation of relevance score. It is noticeable that the proposed model (i.e., OURS) is not affected on relevance modeling due to its capability of filtering some topically irrelevant candidates for the sampling process.
Models  Relevance ($\mathrm{\Delta}$)  Fluency ($\mathrm{\Delta}$)  Dist2 ($\mathrm{\Delta}$) 
LSTMLMtk  1.111 (0.09)  1.270 (0.03)  0.383 (+0.29) 
LSTMS2Stk  1.439 (+0.05)  1.265 (0.20)  0.490 (+0.35) 
TFMLMtk  1.273 (0.14)  1.368 (+0.04)  0.441 (+0.18) 
TFMS2Stk  1.270 (0.15)  1.321 (+0.15)  0.507 (+0.22) 
OURStk  1.606 (+0.10)  1.346 (0.13)  0.544 (+0.20) 
Related Work
Short Text Conversation Short Text Conversation (STC) is usually formulated as a conditional text generation task (?; ?). The sequencetosequence (Seq2Seq) encoderdecoder framework (?; ?; ?) and its variants have been studied extensively for solving this task. ? ? introduce diversitypromoting decoding strategies into the Seq2Seq model. Some (?; ?; ?; ?; ?) attempt to guide the Seq2Seq model to generate keyword/topicaware responses while others (?; ?; ?) try to control the response generation with additional retrieved data. The advanced techniques such as RL, GAN and VAE are also considered for improving conversational experience (?; ?; ?; ?).
Transformerbased Language Model Deep transformerbased architecture (?) has led to significant performance gains on the language modeling task (?; ?; ?), compared to the existing CNN/RNNbased architectures (?; ?; ?). Meanwhile, GPT2 (?) and UniLM (?) are the pioneer works adapting the transformer language model for the conditional text generation tasks.
Conclusion
In this paper, we present a language model based solution instead of traditional Seq2Seq paradigm for handling ShortText Conversation (STC). We firstly tailormake a training strategy to adapt the language model for the STC task. Then, we propose a relevancepromoting transformer language model to distill the relevance clues from the query as well as the topics inferred from the references, and incorporate them into the generation. Moreover, we explore the usage of topk sampling for the STC task to further improve the response diversity. Experimental results on a largescale STC dataset validate that our model is superior to the compared models on both relevance and diversity from automatic and human evaluations.
References
Appendices
Training Details
Our model consists of 6 decoderonly transformer layers with masked selfattention (i.e., $L$=6), where the hidden size ${\mathrm{dim}}_{h}$, number of heads and feedforward size are 512, 8, 1024 respectively. The weights ${\gamma}_{1},{\gamma}_{2}$ for ${\mathcal{L}}^{\mathrm{src}}$ and ${\mathcal{L}}^{\mathrm{kwd}}$ are set as 1.0 and 0.2. We do not introduce the pretrained word/character embeddings but randomly initialize the parameters of the token embedding layer. We employ Adam (?) as optimizer and the initial learning rate is 1e4. We apply linear warmup at the first 10,000 training steps. The batch size is 32 and we train the model up to 20 epoch. We evaluate the model every 30,000 steps and select the model performs best on the validation set for producing the final results.
Human Evaluations
Apart from automatic evaluations, we also conduct human evaluations. Specifically, we randomly sampled 100 queries and recruit five helpers to judge Relevance, Fluency and Acceptance of the generated responses from our model and the baselines. The rating criteria, identical to those in (?), are as follows:
$\bullet $ Relevance: +3: relevant as well as interesting; +2: relevant, including the generic responses; +1: relevant at a distant level; 0: not relevant at all.
$\bullet $ Fluency: +2: fluent; +1: readable but with some grammar mistakes; 0: unreadable.
$\bullet $ Acceptance: the ratio of acceptable responses. Specifically, acceptable response refers to the response with Relevance $\ge 2$ and Fluency $\ge 1$.
Obtaining Informative Query Words
Building the supervision signals ${\mathbf{y}}^{\mathrm{src}}$ in Eq 5 is based on the informative words of each query. The basic idea is that a query word having strong semantic relation with the corresponding response should be regarded as an informative word. The procedure is as follows:

1.
Use keyword extractor^{13}^{13} 13 Here, we use jieba keyword extraction toolkit available at https://github.com/fxsjy/jieba. to obtain the keywords for each response in the training set.

2.
Define the semantic relation score between a query word and the response as the maximal pointwise mutual information (PMI) between a query word and the response keywords.

3.
Select the topranking query words in terms of the calculated semantic relation scores as the informative words.
Obtaining Response Keywords
The proposed Topic Inference (TI) component aims to refine the query representation with the knowledge inferred from response keywords. First of all, we employ jieba keyword extraction toolkit to collect the response keywords. Since one query may correspond to multiple references (i.e., onetomany phenomenon), we aggregate the keyword sets for multiple responses corresponding to the same query. Then, we randomly sample 80% keywords in the aggregated set and regard them as the relevant response keywords ${\mathbf{y}}^{\mathrm{kwd}}$ (in Eq 7) associated with each training instance.
Obtaining Keywords for Evaluation
As mentioned in the Experiment part, calculating the HitQ and HitR metrics need to build a highprecisionlowrecall keyword set for each query/response sentence. We firstly employ jieba keyword extraction toolkit to obtain an initial keyword set for each query/response. Then, we design the following rules to guarantee the precision of the obtained query/response keywords:
$\bullet $ Remove the stop words in the initial keyword set.
$\bullet $ Filter the keyword if the PartofSpeech tag of this keyword does not belong to {N, NS, VN, V, F}.
Additional Details of Experiment
For the automatic evaluation results in Table 1, Bleu and Dist are characterlevel metrics while Hit scores are calculated using the wordbased overlapping statistics.