Review Conversational Reading Comprehension

  • 2019-11-06 17:25:40
  • Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
  • 0


Inspired by conversational reading comprehension (CRC), this paper studies anovel task of leveraging reviews as a source to build an agent that can answermulti-turn questions from potential consumers of online businesses. We firstbuild a review CRC dataset and then propose a novel task-aware pre-tuning steprunning between language model (e.g., BERT) pre-training and domain-specificfine-tuning. The proposed pre-tuning requires no data annotation, but cangreatly enhance the performance on our end task. Experimental results show thatthe proposed approach is highly effective and has competitive performance asthe supervised approach. The dataset is available at\url{}


Quick Read (beta)

Review Conversational Reading Comprehension

Hu Xu1, Bing Liu1, Lei Shu1    Philip S. Yu1,2
1Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA
2Institute for Data Science, Tsinghua University, Beijing, China
{hxu48, liub, lshu3, psyu}

Inspired by conversational reading comprehension (CRC), this paper studies a novel task of leveraging reviews as a source to build an agent that can answer multi-turn questions from potential consumers of online businesses. We first build a review CRC dataset and then propose a novel task-aware pre-tuning step running between language model (e.g., BERT) pre-training and domain-specific fine-tuning. The proposed pre-tuning requires no data annotation, but can greatly enhance the performance on our end task. Experimental results show that the proposed approach is highly effective and has competitive performance as the supervised approach.11 1 The dataset is available at

Review Conversational Reading Comprehension

Hu Xu1, Bing Liu1, Lei Shu1and Philip S. Yu1,2 1Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA 2Institute for Data Science, Tsinghua University, Beijing, China {hxu48, liub, lshu3, psyu}

1 Introduction

Seeking information to assess whether a product or service suits one’s needs is an important activity in consumer decision making. One major hindrance for online businesses is that the consumers often have difficulty to get answers for their questions. With the ever-changing environment, it is very hard, if not impossible, for businesses to pre-compile an up-to-date knowledge base to answer user questions as in KB-QA (kwok2001scaling; fader2014open; yin2015neural; xu2016question). Although community question-answering (CQA) helps (mcauley2016addressing), one has to be lucky to get an existing customer to answer a question quickly. There is work on retrieving whole reviews relevant to a question (mcauley2016addressing; yu2018aware), but it is not ideal for the user to read the whole reviews to fish for answers.

Table 1: An example of RCRC (best viewed in colors): a dialogue with 5 turns of customers’ questions and answer spans from a review.
A Laptop Review:
I purchased my Macbook Pro Retina from my school since I
had a student discount , but I would gladly purchase it from
Amazon for full price again if I had too . The Retina is great
, its amazingly fast when it boots up because of the SSD
storage and the clarity of the screen is amazing as well…
Turns of Questions from a Customer:
q1: how is retina display ?
q2: speed of booting up ?
q3: why ?
q4: what ’s the capacity of that ? (NO ANSWER)
q5: is the screen clear ?

Inspired by conversational reading comprehension (CRC) (reddy2018coqa; choi2018quac; xu-etal-2019-bert), we explore the possibility of turning reviews into a valuable source of knowledge of real-world experiences and using it to answer customer or user multi-turn questions. We call this Review Conversational Reading Comprehension (RCRC). The conversational setting enables the user to go into details via more specific questions and to simplify their questions by either omitting or co-referencing information in the previous context. As shown in Table 1, the user first has an opinion question about “retina display” (an aspect) of a laptop. Then he/she carries (or omits) the question type opinion from the first question to the second question about another aspect “boot-up speed”. Later, he/she carries the aspect of the second question, but changes the question type to opinion reason and then co-references the aspect “SSD” from the third answer and asks for the capacity (a sub-aspect) of “SSD”. Unfortunately, there is no answer in this review. Finally, the customer asks another aspect as in the fifth question. RCRC is defined as follows.

RCRC Definition: Given a review that consists of a sequence of n tokens d=(d1,,dn), a history of past k-1 questions and answers as the context C=(q1,a1,q2,a2,,qk-1,ak-1) and the current question qk, find a sequence of tokens (a textual span) a=(ds,,de) in d that answers qk based on C, where 1sn, sen, and se, or return NO ANSWER (s,e=0) if the review does not contain the answer for qk.

Note that although RCRC focuses on one review, it can potentially be deployed on the setting of multiple reviews (e.g., all reviews for a product), where the context C may contain answers from different reviews. To the best of our knowledge, there are no existing review datasets for RCRC. We first build a dataset called (RC)2 based on laptop and restaurant reviews from SemEval 2016 Task 5.22 2 We choose this dataset to better align with existing research in sentiment analysis. Given the wide spectrum of domains in online businesses and the prohibitive cost of annotation, (RC)2 has limited training data, as in many other tasks of sentiment analysis.

As a result, the challenge is how to effectively improve the performance of RCRC. We adopt BERT (devlin2018bert) as our base model since it can be either a feature encoder or a standalone model that achieves good performance on CRC (reddy2018coqa). BERT bears with task-agnostic features, which require task-specific architecture and many supervised training examples to train(fine-tune) on an end task. As (RC)2 has limited training data, we propose a novel task-aware pre-tuning to further bridge the gap between BERT pre-training and RCRC task-awareness. Pre-tuning requires no annotation of CRC (or RCRC) data but just QA pairs (from CQA) and reviews that are largely available online. The data are general and can potentially be used in other machine reading comprehension tasks. Experimental results show that the proposed approach achieves competitive performance even compared with the supervised approach using a large-scale annotated dataset.

2 BERT and Proposed Pre-tuning

2.1 BERT

BERT is one of the key innovations (peters2018deep; howard2018universal; radford2018improving; devlin2018bert) of unsupervised learning with the idea to learn contextualized representations that are previously only learned from supervised data. It has two novel pre-training objectives that greatly improve the learned representations: masked language model (MLM) and next sentence prediction (NSP). The former predicts a randomly masked word and the latter tries to detect whether two sides of a text are from the same document or not. A training example is formulated as ([CLS],x1:j,[SEP],xj+1:n,[SEP]), where [CLS] is a special token and x1:n is a document (with randomly masked words) split into two sides x1:j and xj+1:n and [SEP] separates those two. We leverage BERT’s architecture and focus on how to design a textual format to bring task-awareness of RCRC into a (pre-tuned) model.

2.2 Textual Format

Inspired by the DrQA system (reddy2018coqa), we formulate an input example x for both RCRC fine-tuning and pre-tuning33 3 We share the same notation for both tasks for brevity. as a composition of the context C, the current question qk, and a review d:


where past QA pairs q1,a1,,qk-1,ak-1 in C are concatenated and separated by two tokens [Q] and [A] and then concatenated with the current question qk as the left side of BERT and the right side is the review document. One can observe that BERT lacks the basic understanding of the RCRC task regarding both the input and output, such as the above input format and textual spans in a review. Limited training data of (RC)2 may not be sufficient to learn such a complex input and output. We propose a pre-tuning stage that can enhance the understanding of the input/output before fine-tuning on (RC)2.

2.3 Data Formulation for Pre-tuning

We first formulate the data for pre-tuning that aims to address the understanding of the textual format. As we have no annotated data except the limited (RC)2 data, we harvest domain QA pairs and reviews (that are largely available online, see Section 3), which are typically organized under an entity (a laptop or a restaurant). The QA pairs and reviews are combined to produce the pre-tuning examples. The process is given in Algorithm 1.

\LinesNumbered\DontPrintSemicolon \SetKwInOutInputInput \SetKwInOutOutputOutput \Input𝒬: a set of QA pairs;
: a set of reviews;
hmax: maximum turns in context. \Output𝒯: pre-tuning data. \BlankLine𝒯{} \[email protected]
\For(q,a)𝒬 x[CLS] \[email protected]
hRandInteger([0,hmax]) \[email protected]
\For1h q′′,a′′RandSelect(𝒬\(q,a)) \[email protected]
xx[Q]q′′[A]a′′\[email protected]
xx[Q]q[SEP] \[email protected]
r1:mRandSelect() \[email protected]
\IfRandFloat([0.0,1.0])>0.5 (_,a)RandSelect(𝒬\(q,a)) \[email protected]
(u,v)(1,1) \[email protected]
\Elseaa \[email protected]
(u,v)(|x|,|x|+|a|) \[email protected]
lRandInteger([0,u]) \[email protected]
d1:nr0:larl+1:u \[email protected]
\Ifu>1 (u,v)(u+|r0:l|,v+|r0:l|) xxd1:n[SEP] \[email protected]
𝒯𝒯+(x,(u,v)) \[email protected]
\algorithmcfname 1 Data Generation Algorithm

The inputs to Algorithm 1 are a set of QA pairs and a set of reviews belonging to the same entity and the maximum number of turns in the context. The output is the pre-tuning data, which is initialized in Line 1. Each example is denoted as (x,(u,v)), where x is the input example and (u,v) indicates the boundary (starting and ending indexes) of an answer for the auxiliary objective (discussed in Section 2.4). Given a QA pair (q,a) in Line 2, we first build the left side of input example x in Line 3-9. After initializing input x in Line 3, we randomly determine the number of turns in the context in Line 4 and concatenate these turns of QA pairs in Line 5-8, where 𝒬\(q,a) ensures the current QA pair (q,a) is not chosen. In Line 9, we concatenate with the current question q. Lines 10-23 build the right side of input example x and the answer boundary. In Line 10, we randomly draw a review r with m sentences. To challenge the pre-tuning stage to discover the semantic relatedness between q and a (for the auxiliary objective), we first decide whether to allow the right side of x contains a (Line 16) for q or a random (negative/no) answer a in Lines 11-12. We also come up with two indexes u and v initialized in Lines 13 and 17. Then, we insert a into review r by randomly picking one from the m+1 locations in Lines 19-20. This gives us d1:n, which has n tokens. We further update u and v to allow them to point to the chunk boundaries of a. Otherwise, BERT should detect no a on the right side and point to [CLS] (u,v=1). Finally, examples are aggregated in Line 25. Algorithm 1 is run k times to allow for enough samplings. Following BERT, we still randomly mask some words in each example x but omitted here for brevity.

2.4 Auxilary Objective

Besides the input, we further adapt BERT to the output of RCRC with an auxiliary objective. The design of this auxiliary objective is to mimic a prediction of a textual span in RCRC, which aims to predict the token spans of an answer randomly inserted in the review or NO ANSWER if a randomly drawn negative answer appears. The implementation of both the auxiliary objective and RCRC model is similar to BERT for SQuAD 2.0 (rajpurkar2018know), so we omit them for brevity. After pre-tuning, we fine-tune using the (RC)2 dataset to show the performance of RCRC.

3 Experiments

We adopt SemEval 2016 Task 5 as the review source for RCRC (to be consistent with research in sentiment analysis), which contains two domains laptop and restaurant. We kept the split of training and testing, and annotated dialogues on each review. Similar to the way of annotating the CoQA data (reddy2018coqa), we first let both annotators read a review and then one annotator asks questions and the other annotator annotates the answer token spans (or no answer)44 4 The annotated data is in the format of CoQA (reddy2018coqa) to help future research. But we do not focus on generative annotation as in CoQA because businesses are sensitive to errors of generative models. To ensure questions are real-world questions, annotators are first asked to read hundreds of community questions and answers (CQA) from real customers. Since the number of testing reviews is small, we encourage annotators to produce as many dialogues as possible. Each training review is encouraged to have 1 or 2 dialogues. The statistics of the annotated (RC)2 dataset is shown in Table 2. We use 20% of the training reviews as the validation set for each domain.

Table 2: Statistics of (RC)2 Datasets.
Training Laptop Restaurant
# of reviews 445 350
# of dialogues 506 382
# of dialog /w 3+ turns 375 315
# of questions 1679 1486
% of no answers 24.3% 24.2%
Testing Laptop Restaurant
# of reviews 79 90
# of dialog 170 160
# of dialog /w 3+ turns 148 135
# of questions 804 803
% of no answers 26.6% 28.0%

For the proposed pre-tuning, we collect QA pairs and reviews for these two domains. For laptop, we collect the reviews from (he2016ups) and QA pairs from (Xu2018pro) both under the laptop category of We exclude products in the test data of (RC)2. This gives us 113,728 laptop reviews and 19,104 QA pairs. For restaurant, we crawl reviews and all QA pairs from the top 60 restaurants in each U.S. city from This ends with 197,333 restaurant reviews and 49,587 QA pairs. Based on the number of QAs, Algorithm 1 is run k=10 times for laptop and k=5 times for restaurant.

To compare with the performance of a fully-supervised approach, we leverage the CoQA dataset with 7,199 documents (covering domains in Children’s Story, Mid/High School Literature, News, Wikipedia, etc.) and 108,647 turns of question/answer span annotated via crowdsourcing.

We compare the following methods: DrQA is a CRC baseline coming with the CoQA dataset55 5 DrQA+CoQA is the above baseline pre-tuned on the CoQA dataset and then fine-tuned on (RC)2 to show that even DrQA pre-trained on CoQA is sub-optimal. BERT66 6 We choose BERTBASE as we cannot fit BERTLARGE into the memory. is the pre-trained BERT weights directly fine-tuned on (RC)2 for ablation study on the effectiveness of pre-tuning. BERT+review first tunes BERT on domain reviews using the same objectives as BERT pre-training and then fine-tunes on (RC)2. We use this baseline to show that a simple domain-adaptation of BERT is not sufficient. BERT+CoQA first fine-tunes BERT on the supervised CoQA data and then fine-tunes on (RC)2. We use this baseline to show that even compared with using this large-scale supervised data, our pre-tuning is still very competitive. BERT+Pre-tuning is the proposed approach.

We set the maximum length of BERT to 256 with the maximum length of context+question to 96 (hmax=9 for Algorithm 1) and the batch size to 16. We perform pre-tuning for 10k steps. CoQA fine-tuning converges in 2 epochs. Fine-tune RCRC is performed for 4 epochs and most runs converged within 3 epochs. We search the maximum number of turns in context C for RCRC fine-tuning using the validation set, which ends with 6 turns for laptop and 5 turns for restaurant. Results are reported as averages of 3 runs. To be consistent, we leverage the same evaluation script as CoQA, which reports turn-level Exact Match (EM) and F1 scores for all turns in all dialogues.

Table 3: RCRC on EM (Exact Match) and F1.
Domain Laptop Rest.
Methods EM F1 EM F1
DrQA 28.5 36.6 41.6 50.3
DrQA+CoQA(supervised) 40.4 51.4 47.7 58.5
BERT 38.57 48.67 46.87 55.07
BERT+review 34.53 43.83 47.23 53.7
BERT+CoQA(supervised) 47.1 58.9 56.57 67.97
BERT+Pre-tuning 46.0 57.23 54.57 64.43

Result Analysis
As shown in Table 3, BERT+Pre-tuning has significant performance gains over BERT fine-tuned directly on (RC)2 by 9%. BERT is overall better than DrQA. But directly using review documents to adapt BERT does not yield better results as in BERT+review. We suspect the task of RCRC still requires a certain degree of general language understanding on the question side and BERT+review also has the effect of (catastrophic) forgetting (kirkpatrick2017overcoming) on such representation. Further, large-scale annotated CoQA data can boost the performance for both DrQA and BERT. However, our pre-tuning approach still has competitive performance and it requires no annotation at all. We examine the errors of BERT+Pre-tuning and realize that both locations of span and span boundaries tend to have errors, indicating a significant room for improvement.

4 Conclusions

In this paper, we aimed to build a novel review-based conversational reading agent using limited annotated data. We proposed a (language-model like) pre-tuning method without requiring any other annotation and empirical results showed its effectiveness.