Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking

  • 2019-10-08 17:08:39
  • Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S. Yu, Richard Socher, Caiming Xiong
  • 7

Abstract

Dialog State Tracking (DST) is a core component in task-oriented dialogsystems. Existing approaches for DST usually fall into two categories, i.e, thepicklist-based and span-based. From one hand, the picklist-based methodsperform classifications for each slot over a candidate-value list, under thecondition that a pre-defined ontology is accessible. However, it is impracticalin industry since it is hard to get full access to the ontology. On the otherhand, the span-based methods track values for each slot through finding textspans in the dialog context. However, due to the diversity of valuedescriptions, it is hard to find a particular string in the dialog context. Tomitigate these issues, this paper proposes a Dual Strategy for DST (DS-DST) toborrow advantages from both the picklist-based and span-based methods, byclassifying over a picklist or finding values from a slot span. Empiricalresults show that DS-DST achieves the state-of-the-art scores in terms of jointaccuracy, i.e., 51.2% on the MultiWOZ 2.1 dataset, and 53.3% when the fullontology is accessible.

 

Quick Read (beta)

Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking

Jian-Guo Zhang1 Kazuma Hashimoto2 Chien-Sheng Wu2 Yao Wan3                            Philip S. Yu1 Richard Socher2 Caiming Xiong2
1University of Illinois at Chicago
3Zhejiang University
2Salesforce Research

{jzhan51,psyu}@uic.edu
[email protected] {k.hashimoto,wu.jason,rsocher,cxiong}@salesforce.com
Abstract

Dialog State Tracking (DST) is a core component in task-oriented dialog systems. Existing approaches for DST usually fall into two categories, i.e, the picklist-based and span-based. From one hand, the picklist-based methods perform classifications for each slot over a candidate-value list, under the condition that a pre-defined ontology is accessible. However, it is impractical in industry since it is hard to get full access to the ontology. On the other hand, the span-based methods track values for each slot through finding text spans in the dialog context. However, due to the diversity of value descriptions, it is hard to find a particular string in the dialog context. To mitigate these issues, this paper proposes a Dual Strategy for DST (DS-DST) to borrow advantages from both the picklist-based and span-based methods, by classifying over a picklist or finding values from a slot span. Empirical results show that DS-DST achieves the state-of-the-art scores in terms of joint accuracy, i.e., 51.2% on the MultiWOZ 2.1 dataset, and 53.3% when the full ontology is accessible.

\useunder

\ul

 

Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking


 

\@float

noticebox[b]

\[email protected]

1 Introduction

With the prevalence of virtual assistants such as Google Assistant, Cortana and Alexa, task-oriented dialog systems are playing important roles in facilitating our daily life, such as booking hotels, reserving restaurants and making traveling plans. Dialog State Tracking (DST) is a core component of task-oriented dialog systems [gao2019neural, young2013pomdp]. It estimates the state of a conversation based on the current utterance and the conversational history. In DST, a state consists of a set of <domain,slot,value> triplets, which represent the values of requested slots given the active domain or intent at the current turn. DST aims to track all the states accumulated across the conversational turns. Fig. 1 shows a dialogue with corresponding annotated turn states.

Figure 1: An example of dialog state tracking process for booking a hotel and reserving a restaurant. Each turn contains a user utterance (grey) and a system utterance (orange). The dialog state tracker (green) tracks all the <domain,slot,value> triplets until the current turn. Blue color denotes the new state appeared at that turn. Best viewed in color.

Traditional approaches for DST usually rely on hand-crafted features and domain-specific lexicon (besides ontology) [henderson2014word, zilka2015incremental, wen2016network, mrkvsic2015multi]. Recent data-driven deep learning models have shown promising performances in DST [xu2018end, wu2019transferable, lee2019sumbt, goel2019hyst, chao2019bert, ren2018towards, ramadan2018large, liu2017end, zhong2018global], which can be categorized into two classes [xu2018end, gao2019dialog, ramadan2018large, zhong2018global], i.e., picklist-based and span-based. The picklist-based approaches [ramadan2018large, zhong2018global] treat domain-slot pairs as picklist-based slots, where the values are predicted through performing classification on the candidate-value list. They usually require full access to the pre-defined ontology. However, in practical, we would only have partial ontology since full ontology is hard and expensive to access in industry. Even if a full ontology exists, it is computationally expensive to enumerate all the values when the full ontology is very large and diverse [wu2019transferable, xu2018end], e.g., values for the time slot could have unlimited choices. The span-based approaches [gao2019dialog, xu2018end] treat domain-slot pairs as span-based slots, where the values can be found through span matching with start and end positions in the dialog context. However, it is nontrivial to handle situations where values do not appear in the dialog context or have various descriptions by users.

To tackle the aforementioned challenges and borrow advantages from both worlds, this paper proposes to treat the domain-slot pairs of a dialog state as span-based slots and picklist-based slots. The values of each span-based slot can be found through span matching with start and end positions in the dialog context, and the values for each picklist-based slot are found in corresponding candidate-value list. The way to decide whether a slot belongs to span-based slots or a picklist-based slot depends on human heuristics. For example, it is common that when users book a hotel, the requests for parking are usually yes or no with limited choices, we treat these kinds of slots as picklist-based slots. Whereas the number of days the user will stay have unlimited values and they can be found in context, we treat these kind of slots as span-based slots. Since our approach can treat all slots as span-based slots or picklist-based slots according to real scenarios or datasets, it is flexible when the dialog system has access to the ontology or all the values can be found in the dialog context. Inspired by recent successful experience in visual question answering [xiong2016dynamic] and text reading comprehensions [chen2018neural], where the former have candidate-value list and the latter have text spans with start and end positions. We design a Dual-Strategy Dialog State Tracking model (DS-DST) which depends on BERT question answering models [devlin2018bert] and enables slot-value predictions for both span-based slots and picklist-based slots.

Our contributions are summarized as follows.

We design an approach which treats domain-slot pairs as span-based slots and picklist-based slots based on human heuristics. Our approach mitigates the limitations of relying on fixed-vocabulary or unseen values of span and is flexible to different datasets and real scenarios;

We build a dual strategy model for multi-domain DST, which achieves state-of-the-art results on the MultiWOZ 2.1 dataset [eric2019multiwoz];

2 Related Work

DST tracks dialog states in complicated conversations across multiple domains with many slots. It has been a hot research topic during the past few years, along with the development of Dialogue State Tracking Challenges [williams2013dialog, henderson2014second, henderson2014third, kim2017fourth, kim2016fifth, rastogi2019towards]. Traditional approaches usually rely on hand-crafted features or domain-specific lexicon [henderson2014word, wen2016network], which are difficult to extend to new domains. In addition, they require a pre-defined ontology, in which the values of a slot are constrained by a set of candidate values [ramadan2018large, liu2017end, zhong2018global, lee2019sumbt]. Furthermore, these approaches are hard to adapt to unseen values and large vocabularies. To tackle these issues, several methods are proposed to extract slot values through span matching with start and end positions in the dialog context. For example, [xu2018end] utilizes an attention-based pointer network to copy values from the dialog context. [gao2019dialog] treats DST as a reading comprehension problem and incorporates a slot carryover model to copy states from previous conversational turns. However, tracking states only from the dialog context is not enough as many values in DST cannot be found in the context due to annotation errors or diverse descriptions for slot values from users. On the other hand, pre-trained models such as BERT [devlin2018bert] and GPT [radfordimproving] have showed promising performances on many down-stream tasks. Among them, DSTreader [gao2019dialog] utilizes BERT as word embeddings for dialog contexts, SUMBT [lee2019sumbt] employs BERT to extract representations of candidate values, and BERT-DST [chao2019bert] adopts BERT to encode the inputs of the user turn as well as the previous system turn. Different from these approaches, our method is based on BERT from the perspective of visual question answering [xiong2016dynamic] and text reading comprehensions [chen2018neural].

Recent generative approaches [lei2018sequicity, wu2019transferable] generate slot values for DST without relying on fixed vocabularies and spans. However, such generative methods would generate ill-formatted strings (e.g., repeated words) upon generating long strings, which is common in DST. For example, the hotel address may be long and a small difference makes the whole dialog state tacking incorrect. By contrast, both the picklist-based and span-based methods can rely on existing strings rather than generating them.

Figure 2: The architecture of the proposed DS-DST model. The left part is a fixed BERT model which outputs the representations of candidate values in the list for each picklist-based slot (purple). The right part is the other fine-tuned BERT model which outputs representations for the concatenation of each domain-slot pair and the recent dialog context. θgate are used for the slot-gate classification for all domain-slot pairs. θspan are used to predict spans for each span-based domain-slot pair (orange). θpicklist are used for cosine similarity matching over the candidate-value list for each picklist-based domain-slot pair (purple). Best viewed in color.

3 DS-DST: a Dual Strategy for DST

Let X={(U1sys,U1usr),,(UTsys,UTusr)} denote a set of pairs of a system utterance Utsys and a user utterance Utusr (1tT), given a dialogue context with T turns. Each turn (Utsys,Utusr) talks about a particular domain (e.g., hotel), and a certain number of slots (e.g., price range) are associated with the domain. We denote all the N possible domain-slot pairs as S={S1,SN}, where each domain-slot pair has n tokens. The task of DST is tracking the states over the whole dialogue; in other words, at each turn t we need to predict the values for each S (e.g., <hotel,pricerange,cheap>) considering the context Xt={(U1sys,U1usr),,(Utsys,Utusr)}, where Xt has m tokens. We follow recent strategies [wu2019transferable, xu2018end] to predict the values for all the domain-slot pairs in S at each turn. Our intuition is that we can find values from pre-defined picklists, if we have access to the partial or full candidate-value lists. Otherwise, we need to directly find the values as text spans in the dialogue context. We call the former type a picklist-based slot, and the latter one a span-based slot. Here we assume that M domain-slot pairs in S are treated as the span-based slots, and the remaining N-M pairs as the picklist-based slots. Each picklist-based slot has C possible candidate values, i.e., V1,,VC, where C is the size of the picklist, and each value has c tokens.

We then propose a novel dual strategy for DST. Fig. 2 shows the overview of our model. We first utilize a pre-trained BERT [devlin2018bert] to encode information about the dialogue context Xt along with each domain-slot pair in S, to obtain contextualized representations conditioned on the domain-slot information. We then use a slot gate to handle special types of values. For the span-based slots, we utilize a two-way linear mapping to find text spans. For the picklist-based slots, we select the most plausible values from the picklists based on the contextual representation.

3.1 Slot-Context Encoder

We employ a pre-trained BERT [devlin2018bert] to encode the domain-slot types and dialog contexts. For the jth domain-slot pair and the dialog context Xt at turn t, we concatenate them and get corresponding representations:

Rtj=BERT([𝙲𝙻𝚂]Sj[𝚂𝙴𝙿]Xt), (1)

where [CLS] is a special token added in front of every sample, and [SEP] is a special separator token. The outputs of BERT in Eq. (1) can be decomposed as Rtj=[rtj𝙲𝙻𝚂,rtj1,,rtjK], where rtj𝙲𝙻𝚂 is the aggregate representation for the total K sequential input tokens, [rtj1,,rtjK] are the token-level representations. They will be used for slot-value predictions in the following sections, and BERT will be fine-tuned during the training process.

3.2 Slot-Gate Classification

As there are many domain-slot pairs in multi-domain dialogues, it is nontrivial to correctly predict whether a domain-slot pair appears at each turn of the dialogue. Here we add a slot gate classification module [wu2019transferable, xu2018end] to our neural network. Specifically, at turn t, the classifier makes a decision among {none,dontcare,prediction}, where none denotes that a domain-slot pair is not mentioned at this turn, dontcare implies that the user can accept any values for this slot, and prediction represents that the slot should be processed by the model with a real value. We utilize rtj𝙲𝙻𝚂 for the slot-gate classification, and the probability for the jth domain-slot pair at turn t is calculated as:

Ptjgate=softmax(Wgate(rtj𝙲𝙻𝚂)+bgate), (2)

where Wgate and bgate are learnable parameters and bias, respectively.

The loss for slot gate classification is computed as:

gate=t=1Tj=1N-log(Ptjgate(ytjgate)), (3)

where ytjgate is the one-hot gate label for the jth domain-slot pair at turn t.

3.3 Span-Based Slot-Value Prediction

For each span-based slot, its value can be mapped to a span with start and end position in the dialog context, e.g., slot leaveat of the taxi domain has spans 4:30pm in the context. We take token-level representations [rtj1,,rtjK] of the dialog context as input, and apply a two-way linear mapping to get a start vector αtjstart and an end vector αtjend:

[αtjstart,αtjend]=Wspan([rtj1,,rtjK])+bspan, (4)

where Wspan and bspan are learnable parameters and bias, respectively.

The probability of the ith word being the start position of the span is computed as: ptjstarti=eαtjstartrtjikαtjstartrtjk, and the loss for the start position prediction can be calculated as:

start=t=1Tj=1M-log(Ptjstart(ytjstart)), (5)

where ytjstart is the one-hot start position label for the jth domain-slot pair at turn t.

Similarly we can get the loss end for end positions prediction. The total loss span for the span-based slot-value prediction is the summation of start and end.

3.4 Picklist-Based Slot-Value Prediction

Each picklist-based slot has several candidate values, e.g., slot pricerange in the hotel domain has possible values {cheap,expensive,moderate}. At turn t, for the jth domain-slot type, we first use a separate pre-trained BERT to get the aggregate representations of corresponding values:

Yj=BERT([𝙲𝙻𝚂]Vj[𝚂𝙴𝙿]), (6)

where Yj={y1𝙲𝙻𝚂,,yL𝙲𝙻𝚂}, L is the the number of candidate values. Note that during the training process the model parameters of this separate BERT are fixed.

We formulate a relevance score of the aggregate representation given a reference candidate by the cosine similarity [lin2017adversarial]:

cos(rtj𝙲𝙻𝚂,yl𝙲𝙻𝚂)=rtj𝙲𝙻𝚂(yl𝙲𝙻𝚂)rtj𝙲𝙻𝚂yl𝙲𝙻𝚂, (7)

where rtj𝙲𝙻𝚂 and yl𝙲𝙻𝚂 are the aggregate representations from the slot-context encoder and the reference candidate value, respectively.

During the training process, we employ a hinge loss to enlarge the difference between the similarity of rtj𝙲𝙻𝚂 to the target value and that to the most similar value in the candidate-value list:

picklist=t=1Tj=1N-Mmax(0,λ-cos(rtj𝙲𝙻𝚂,ytarget𝙲𝙻𝚂)+maxyl𝙲𝙻𝚂ytarget𝙲𝙻𝚂cos(rtj𝙲𝙻𝚂,yl𝙲𝙻𝚂)), (8)

where λ is a constant margin, l{1,,L}.

3.5 Training Objective

During training process, the above three modules can be jointly trained and share parameters of BERT. We optimize the summations of different losses as:

total=sg+Lspan+picklist. (9)

4 Experimental Setup

Domain Hotel Train Restaurant Attraction Taxi All Domains Total Turns
Slots
price range
type
parking
book stay
book day
book people
area
stars
internet
name
destination
day
departure
arrive by
book people
leave at
food
price range
area
name
book time
book day
book people
area
name
type
leave at
destination
departure
arrive by
- -
Train 3381 3103 3813 2717 1654 8421 56668
Validation 416 484 438 401 207 1001 7374
Test 394 494 437 395 195 1000 7368
Table 1: The dataset information of MultiWOZ 2.1. There are in total 30 domain-slot pairs of 5 selected domains as shown in the top two rows. The last three rows show the number of dialogues related to each domain. The last two columns show the total number of the dialogues and total conversational turns for all the domains.

4.1 Dataset

To demonstrate the performance of our DS-DST, we use the recent released MultiWOZ 2.1 dataset [eric2019multiwoz]. It is one of the largest multi-domain dialogue corpora to-date with seven distinct domains and over 10000 dialogues. Compared with the original MultiWOZ 2.0 dataset [budzianowski2018multiwoz], it conducts dataset correction, including correcting dialog states, typos and mis-annotations to reduces several substantial noises, making the dataset more challenging (more details can be found in [eric2019multiwoz]). As hospital and police contain very few dialogues (5% of total dialogues), and they only appear in the training dataset, we ignore them in our experiments, following [wu2019transferable]. We adopt only five domains (train, restaurant, hotel, taxi, attraction) and obtain totally 30 domain-slot pairs in the experiments. Table 1 summarises the domain-slot pairs and their corresponding statistics. We follow the standard training/validation/test split strategy provided in the original dataset [budzianowski2018multiwoz, eric2019multiwoz], and the data pre-processing way provided in [wu2019transferable]. Instead of formulating the candidate-value list for each picklist-based slot through directly using the incomplete ontology [lee2019sumbt] of MultiWOZ 2.1, we construct the candidate-value list for each picklist-based slot through traversing the dataset.

4.2 Models

Due to the lack of official pre-processing standard for the MultiWOZ dataset,11 1 The pre-processing way provided in [wu2019transferable] is currently suggested by the data providers: http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ different ways would affect the performance evaluation (more details can be found in Sec. 6). To make a fair comparison, we only adopt several state-of-the-art baselines which either follow the same data pre-processing way [wu2019transferable] or the results are provided by the data providers [eric2019multiwoz]:

SpanPtr [xu2018end]: It applies a pointer-network based model to find text spans with start and end pointers for each domain-slot pair.

FJST [eric2019multiwoz]: It contains a bidirectional LSTM network to encode the dialog context and a separate feedforward network to predict each dialog state slot.

HyST [goel2019hyst]: It is a hybrid approach based on hierarchical RNNs and an open-vocabulary generation.

DSTreader [gao2019dialog]: It models the DST from the perspective of text reading comprehensions and applies a pre-trained BERT to set word embeddings.

TRADE [wu2019transferable]: It contains a slot gate module for slots classification and a pointer generator for states generation.

Among the above baselines. SpanPtr and TRADE are originally tested on DSTC2 dataset [henderson2014second] and MultiWOZ 2.0 dataset, respectively. We utilize the publicly available code for TRADE22 2 https://github.com/jasonwu0731/trade-dst and implement SpanPtr, to evaluate these two models on the MultiWOZ 2.1 dataset.

For our proposed methods, we design three variants:

DST-Span: Similar to [xu2018end], it treats all domain-slot pairs as span-based slots, where corresponding values for each slot are extracted through text spans (string matching) with start and end positions in the dialog context;

DST-Picklist: It treats all domain-slot pairs as picklist-based slots, where corresponding values for each slot are found in the candidate-value list;

DS-DST: It includes both span-based slots and picklist-based slots. In the default setting, as some slots related to time and number could have lots of possible values, e.g., the slot of arriveby of taxi domain could fall into {00:00-23:59} depends on the users. We treat these kinds of slots as span-based slots, resulting in five slot types across four domains (nine domain-slot pairs in total). For the other slots which have limited set of values, e.g., the slot of type of hotel domain has two candidate values {hotel,guesthouse}. We treat these kinds of slots as picklist-based slots, and there are in total 21 such domain-slot pairs.

During evaluation, we evaluate all models using the joint accuracy metric. For joint accuracy, at each turn, the joint accuracy is 1.0 if and only if all <domain,slot,value> triplets are predicted correctly, otherwise 0.

4.3 Training Details

We employ the pre-trained BERT model (bert-base-uncased) with 12 layes of 768 hidden units and 12 self-attention heads.33 3 https://github.com/huggingface/transformers/tree/master/examples During the fine-tuning process, we update all the model parameters using the BertAdam [devlin2018bert] optimizer with an initial learning rate 1e-4. The proportion for learning rate warmup is set to 0.1. We follow the learning rate decay mechanism as in [lee2019sumbt], and we set early stopping based on the join accuracy on the validation set. The constant margin λ is set to 0.5 for DS-DST and DST-Picklist. The maximum total input sequence length after WordPiece tokenization for BERT is set to 512, and the maximum training epoch is set to 5.

Joint Accuracy
SpanPtr [xu2018end] 29.09%
FJST [eric2019multiwoz] 38.00%
HyST [goel2019hyst] 38.10%
DSTreader [gao2019dialog] 36.40%
TRADE [wu2019transferable] 45.96%
DST-Span 40.39%
DS-DST 51.21%
DST-Picklist 53.30%
Table 2: Joint accuracy on the test set of MultiWOZ 2.1. : results reported by [eric2019multiwoz].

5 Experimental Results

Table 2 shows the results on the test set. We can see that DS-DST and DST-Picklist achieve the top-two performance, which surpasses the current state-of-the-art TRADE model by 5.25% and 7.34%, respectively. Comparing DST-Span and DS-DST, we can find that treating slots as span-based and picklist-based slots is indeed helpful in multi-domain DST. The performance of DST-Picklist shows that our method could further improve the DST performance when the full ontology or database is accessible.

Compared with those methods which predict all slot values through span matching with start and end positions in the dialog context, we can observe that DST-Span which employs the strength of BERT outperforms SpanPtr by 11.30%, and it also outperforms the DSTreader which uses the pre-trained BERT model as word embeddings. We attribute the performance difference between DST-Span and the other two variations to the limitations of using span matching, since some values can not be found in the dialog context due to the annotation errors or diverse descriptions from users. For example, the ground truth label for hotel parking is ‘yes’ while it is described in a different way as ‘need parking’ in the user utterances.

5.1 Analysis

In this paper, we design a dual-strategy model, which classifies domain-slot pairs into span-based slots and picklist-based slots based on human heuristics. To further investigate the effectiveness of our methods, we redesign the SpanPtr model which originally predicts values of each domain-slot pair through attention based span matching with start and end positions in the dialog context. The new model SpanPtr-New follows the SpanPtr architecture with the exception that some slots are picklist-based slots followed our default setting and can be found in candidate-value lists.

Table 3 presents the comparative results. DST-Picklist outperforms DST-Span by 10.82%, SpanPtr-New outperforms SpanPtr by 13.08%, which implies that our dual strategy using both span-based slots and picklist-based slots could largely improve the DST performance when tracking states across different domains. Furthermore, compared with SpanPtr and SpanPtr-New, we can observe that DST-Span and DS-DST improve the performance by 11.30% and 9.04%, respectively. In addition, DST-Span and DS-DST both achieve better performance when compared with DSTreader. It verifies the effectiveness of our way to apply pre-trained models to the DST task.

SpanPtr SpanPtr-New DST-Span DS-DST
Joint Accuracy 29.09% 42.17% 40.39% 51.21%
Table 3: Joint accuracy on the test set of MultiWOZ 2.1.
Concatenation Slot Description Question Asking
DS-DST 54.72% 55.74% 53.91%
DST-Picklist 56.33% 56.75% 56.17%
Table 4: Joint Accuracy with different representations of the domain-slot pairs on the validation set. ‘Concatenation’ concatenates the domain and slot directly, ‘Slot Description’ applies a sentence description to describe each domain-slot pair, ‘Question Asking’ represents the domain-slot pair in a question asking way.

As our model is inspired by visual question answering and text reading comprehensions, the way to handle domain-slot pairs is similar to question answering models. Here we investigate the impacts of different ways of representing domain-slot pairs on the dialog state tracking performance. We design three types of representations: (1) concatenation: It is the concatenation of the domain and slot directly, e.g., hotelpricerange represents the slot pricerange of hotel domain, which is the default setting used in the paper. (2) Slot Description: It is the slot description for the domain-slot pair, e.g., pricerangeofthehotel. (3) Question Asking: It interacts the domain-slot pair with the dialog context through asking a question, e.g., whatisthepricerangeofthehotel?. More details can be found in the appendix.

Table 4 shows the performances of different representations on the validation set. We can see that these three types of representations achieve similar performance, where using slot descriptions is slightly better than the other two types, and the ‘Question Asking’ type doesn’t improves the performance. We hypothesis that in our design of questions, there are only few question variations, e.g, ‘what’, ‘where’, ‘which’, where similar question types for different slots may affect the DST performance. In real applications, it could be better to use slot descriptions when there are carefully designed descriptions for different domain-slot pairs.

Threshold-10 Threshold-100 DST-Span DS-DST
Joint Accuracy 49.08% 54.11% 43.25% 54.72%
Table 5: Joint accuracy on the validation set based on different variations of choosing span-based slots and picklist-based slots. Threshold-10 and Threshold-100 mean choosing picklist-based slots based on the size of candidate-value lists.

In the paper’s default setting, the domain-slot pairs related to time and number are treated as span-based slots followed human heuristics. To investigate the potential effects of ways to decide span-based slots and picklist-based slots, we design two more variations, where we select picklist-based slots through setting thresholds for the sizes of candidate-value lists: (1) Threshold-10: It treats the domain-slot pairs with candidate values no more than 10 choices as picklist-based slots, while the other domain-slot pairs as span-based slots, resulting in 15 picklist-slot pairs; (2) Threshold-100: It treats the domain-slot pairs with no more than 100 candidate values as picklist-based slots, resulting in 23 picklist-slot pairs.

Table 5 presents the results of different variations on the validation set. Based on the results, we can see that treating all slots as span-based slots cannot help multi-domain DST performance. While deciding the slot types based on threshold is helpful, its performance is worse than choosing span-based slots and picklist-based slots based on human heuristics. For example, Threshold-100 has more picklist-based slots (23), but its performance is worse than DS-DST with less slots (21) based on human heuristics.

6 Open Discussion

With the recent release of MultiWOZ 2.0 [budzianowski2018multiwoz] and its later update MultiWOZ 2.1 [eric2019multiwoz], multi-domain dialog state tracking is enjoying popularity in enhancing task-oriented dialog systems, handling tasks across different domains and supporting large number of services. However, a potential problem is that no standard ways are available to handle the MultiWOZ dataset; the test set are usually modified based on different pre-processing ways. For example, [wu2019transferable] fixes general label errors for the dataset,44 4 https://github.com/jasonwu0731/trade-dst/blob/master/utils/fix_label.py and [lee2019sumbt] treats all the slot-value labels in the dataset not appeared in the ontology as none.55 5 see convert_to_glue_format.py under the
https://github.com/SKTBrain/SUMBT/blob/master/data/multiwoz/original.zip
Moreover, there are around 40% of annotation errors of turns in the MultiWOZ 2.0 dataset as reported in [eric2019multiwoz], and more than 30% of that in the MultiWOZ 2.1 dataset from our statistics, hence even slight modifications on the test set would make a difference. To make fair comparisons and facilitate the research on multi-domain dialog state tracking in task-oriented dialog systems, it is important to have the standard way to handle the test set and the evaluation process.

7 Conclusion

In this work, we present the DS-DST model for multi-domain dialog state tracking, which treats domain-slot pairs as span-based slots and picklist-based slots based on human heuristics. Our strategy is consistent with real scenarios and flexible to real applications and datasets. DS-DST mitigates the issues existing in previous work and achieves state-of-art results on the MultiWOZ 2.1 dataset.

References

Appendix A Appendix

A.1 Different Representations of the Domain-Slot Pairs

Table 6 presents the three ways of representing 30 domain-slot pairs over five domains (train, restaurant, hotel, taxi, attraction) on the MultiWOZ 2.1 dataset. The other two domains (police, hospital) only appear in the training set with very few dialogues (5% of total dialogues), and we do not show them here. The slots of (arrive by, leave at, book stay, book people, book time) belong to the span-based slots (nine in total), and the others belong to the picklist-based slots (21 in total). ‘Slot Description’ and ‘Question Asking’ are only used in Sec 5.1, and there could be better ways to design them.

Concatenation Slot Description Question Asking
hotel price range price range of hotel what is the price range for the hotel?
hotel type type of hotel what is the hotel type?
hotel parking parking of hotel does the person need parking for the hotel?
hotel book stay number of days to stay in the hotel how many days does the person want to stay for the hotel?
hotel book day date to book the hotel which day does the person want to book for the hotel?
hotel book people number of people to book for the hotel how many people does the person want to book for the hotel?
hotel area area of hotel which area does the person want to select for the hotel?
hotel stars stars of hotel what star does the person need for the hotel?
hotel internet internet of hotel does the person need internet for the hotel?
train destination destination of the train where is the destination of the train?
train day date to book for the train which day does the person want to book for the train?
train departure departure place of the train where is the departure of the train?
train arrive by arrival time of the train what time will the train arrive by?
train book people number of people to book for the train how many people does the person want to book for the train?
train leave at time when the train leave what time will the train leave at?
attraction area area of attraction which area does the person want to select for the attraction?
restaurant food food of restaurant what type of the food does the person need for the restaurant?
restaurant price range price range of the restaurant what is the price range for the restaurant?
restaurant area area of restaurant which area does the person want to select for the restaurant?
attraction name name of attraction what is the attraction name?
restaurant name name of restaurant what is the restaurant name?
attraction type type of attraction what is the attraction type?
hotel name name of hotel what is the hotel name?
taxi leave at time when the taxi leave what time will the taxi leave at?
taxi destination destination of the taxi where is the destination of the taxi?
taxi departure departure place of taxi where is the departure of the taxi?
restaurant book time time to book for the restaurant what time does the person want to book for the restaurant?
restaurant book day date to book for the restaurant which day does the person want to book for the restaurant?
restaurant book people number of people to book for the restaurant how many people does the person want to book for the restaurant?
taxi arrive by arrival time of the taxi what time will the taxi arrive by?
Table 6: Different representations of the domain-slot pairs of the MultiWOZ 2.1 dataset. ‘Concatenation’ concatenates the domain and slot directly, and it is used as one of default settings in the paper. ‘Slot Description’ applies a sentence description to describe each domain-slot pair. ‘Question Asking’ represents the domain-slot pair in a question asking way.

A.2 Sample Output

We present outputs of DS-Span and DS-DST in all the turns for two sample dialogues (MUL0729, PMUL2428) on the validation set of the MultiWOZ 2.1 dataset. Table 7 and Table 8 show the results for MUL0729 and PMUL2428 respectively. In Table 7, hotel type and hotel internet are predicted incorrectly by DST-Span, where the value yes of hotel internet has a different description free wifi in the dialog context. For this type of values, DST-Span cannot find the spans directly in the dialog context. In Table 8, DST-Span does not predict the state <taxi, departure, funky fun house> at turn 6.

User: i am looking for a train from ely to cambridge . do you have such trains ?
Ground Truths: <train, destination, cambridge>, <train, departure, ely>
Predicted Dialog State (DST-Span): <train, destination, cambridge>, <train, departure, ely>
Turn 1 Predicted Dialog State (DS-DST): <train, destination, cambridge>, <train, departure, ely>
System: there are 70 trains do you have a date and time you would like to go ?
User: i would like to leave on a sunday and arrive by 11:30 .
Ground Truths: <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely>, <train, day, sunday>
Predicted Dialog State (DST-Span): <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely>, <train, day, sunday>
Turn 2 Predicted Dialog State (DS-DST): <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely}, <train, day, sunday>
System: tr4212 is arriving at 9:52 would that work for you ?
User: what time does that train depart ?
Ground Truths: <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely>, <train, day, sunday>
Predicted Dialog State (DST-Span): <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely>, <train, day, sunday>
Turn 3 Predicted Dialog State (DS-DST): <train, arrive by, 11:30>, <train, destination, cambridge>, <train, departure, ely>, <train, day, sunday>
System: the train leaves at 9:35 . shall i book it for you ?
User: no , thank you . i would like to see if you could find a place for me to stay that would have a 4 star rating .
Ground Truths: <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>
Predicted Dialog State (DST-Span): <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>
Turn 4 Predicted Dialog State (DS-DST): <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>
System: of course , there are many 4 star hotel -s to choose from . would you like me to book 1 for you ?
User: not yet , thanks . are any of them guesthouses ?
Ground Truths: <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, type, guest house>
Predicted Dialog State (DST-Span): <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>, \ul<hotel, type, ;>
Turn 5 Predicted Dialog State (DS-DST): <train, day, sunday>, <hotel, stars, 4>, <train, destination, cambridge>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, type, guest house>
System: there are 18 guesthouses to choose from , do you have a preference to the area you would like to stay ?
User: i need a 4 star , and in the east with free wifi for 4 people , 5 nights . i’ll need a reference number .
Ground Truths:
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, type, guest house>
Predicted Dialog State (DST-Span):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, \ul<hotel, internet, ;>,
<train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, \ul<hotel, type, ;>
Turn 6 Predicted Dialog State (DS-DST):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, type, guest house>
System: do you want that guesthouse reservation to begin on sunday ?
User: yes . i need 5 nights starting on sunday .
Ground Truths:
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<hotel, type, guest house>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, book day, sunday>
Predicted Dialog State (DST-Span):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
\ul<hote, type, ;>, <train, arrive by, 11:30>, <train, departure, ely>, <hoelm area, east>, <hotel, book day, sunday>
Turn 7 Predicted Dialog State (DS-DST):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<hotel, type, guest house>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, book day, sunday>
System: i have confirmed your reservation at allenbell starting on sunday for 5 nights . your reference number is sltivabu . can i help with anything else ?
User: that is all i need . thank you so much for all your help .
Ground Truths:
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<hotel, type, guest house>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, book day, sunday>
Predicted Dialog State (DST-Span):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
\ul<hotel, type, ;>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, book day, sunday>
Turn 8 Predicted Dialog State (DS-DST):
<train, day, sunday>, <hotel, book stay, 5>, <hotel, book people, 4>, <hotel, stars, 4>, <train, destination, cambridge>, <hotel, internet, yes>,
<hotel, type, guest house>, <train, arrive by, 11:30>, <train, departure, ely>, <hotel, area, east>, <hotel, book day, sunday>
Table 7: Predicted dialog states of DST-Span and DS-DST for dialogue ID MUL0729 with domains (train, hotel) on the MultiWOZ 2.1 dataset.
User: i am planning a trip to go to a particular restaurant , can you assist ?
Ground Truths:
Predicted Dialog State (DST-Span):
Turn 1 Predicted Dialog State (DS-DST):
System: sure , what is the name of this particular restaurant ?
User: it s called nandos and i would like to book it for monday at 15:00 . there will be 6 people .
Ground Truths: <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book time, 15:00>, <restaurant, book people, 6>
Predicted Dialog State (DST-SPan): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book time, 15:00>, <restaurant, book people, 6>
Turn 2 Predicted Dialog State (DS-DST): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book time, 15:00>, <restaurant, book people, 6>
System: no problem ! i have your table reserved for 15:00 on monday . they will hold your table for 15 minutes , your reference number is hvb51vam .
User: thank you . am also looking for place -s to go in town . the attraction should be in the east
Ground Truths: <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
Predicted Dialog State (DST-Span): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
Turn 3 Predicted Dialog State (DS-DST): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
System: what are you looking to do ? my system has located 10 place -s .
User: could you recommend something ?
Ground Truths: <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
Predicted Dialog State (DST-Span): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
Turn 4 Predicted Dialog State (DS-DST): <restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
System: yes , the funky fun house is a great place .
User: great ! can you tell me what the entrance fee is please ?
Ground Truths:
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<attraction, name, funky fun house>
Predicted Dialog State (DST-Span):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<attraction, name, funky fun house>
Turn 5 Predicted Dialog State (DS-DST):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>
<attraction, name, funky fun house>
System: no , i am sorry . you will have to call them for the entrance fee . the phone number is 01223304705 .
User: thank you . i am also looking to book a taxi to travel between the 2 . i need it to arrive to the restaurant by the reservation time .
Ground Truths:
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house>, <taxi, destination, nandos>, \ul <taxi, departure, funky fun house>
Predicted Dialog State (DST-Span):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house>, <taxi, destination, nandos>
Turn 6 Predicted Dialog State (DS-DST):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house> , <taxi, destination, nandos>, <taxi, departure, funky fun house>
System: your taxi is booked and will be a white audi . the contact number is 07057575130 . how else may i help you ?
User: that s all . thank you for your help !
Ground Truths:
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house> , <taxi, destination, nandos>, <taxi, departure, funky fun house>
Predicted Dialog State (DST-Span):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house> , <taxi, destination, nandos>, <taxi, departure, funky fun house>
Turn 7 Predicted Dialog State (DS-DST):
<restaurant, book day, monday>, <restaurant, name, nandos>, <restaurant, book people, 6>, <attraction, area, east>, <restaurant, book time, 15:00>,
<taxi, arrive by, 15:00>, <attraction, name, funky fun house> , <taxi, destination, nandos>, <taxi, departure, funky fun house>
Table 8: Predicted dialog states of DST-Span and DS-DST for dialogue ID PMUL2428 with domains (taxi, attraction, restaurant) on the MultiWOZ 2.1 dataset.