Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports

  • 2019-11-06 18:25:00
  • Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, Curtis P. Langlotz
  • 1

Abstract

Neural abstractive summarization models are able to generate summaries whichhave high overlap with human references. However, existing models are notoptimized for factual correctness, a critical metric in real-worldapplications. In this work, we propose to evaluate the factual correctness of agenerated summary by fact-checking it against its reference using aninformation extraction module. We further propose a training strategy whichoptimizes a neural summarization model with a factual correctness reward viareinforcement learning. We apply the proposed method to the summarization ofradiology reports, where factual correctness is a key requirement. On twoseparate datasets collected from real hospitals, we show via both automatic andhuman evaluation that the proposed approach substantially improves the factualcorrectness and overall quality of outputs from a competitive neuralsummarization system.

 

Quick Read (beta)

Optimizing the Factual Correctness of a Summary:
A Study of Summarizing Radiology Reports

Yuhao Zhang1, Derek Merck2, Emily Bao Tsai1,
Christopher D. Manning1, Curtis P. Langlotz1
1Stanford University 2University of Florida
{yuhaozhang, ebtsai, manning, langlotz}@stanford.edu
[email protected]
Abstract

Neural abstractive summarization models are able to generate summaries which have high overlap with human references. However, existing models are not optimized for factual correctness, a critical metric in real-world applications. In this work, we develop a general framework where we evaluate the factual correctness of a generated summary by fact-checking it against its reference using an information extraction module. We further propose a training strategy which optimizes a neural summarization model with a factual correctness reward via reinforcement learning. We apply the proposed method to the summarization of radiology reports, where factual correctness is a key requirement. On two separate datasets collected from real hospitals, we show via both automatic and human evaluation that the proposed approach substantially improves the factual correctness and overall quality of outputs from a competitive neural summarization system.

Optimizing the Factual Correctness of a Summary:
A Study of Summarizing Radiology Reports


Yuhao Zhang1, Derek Merck2, Emily Bao Tsai1, Christopher D. Manning1, Curtis P. Langlotz1 1Stanford University 2University of Florida {yuhaozhang, ebtsai, manning, langlotz}@stanford.edu [email protected]

1 Introduction

Background: radiographic examination of the chest. clinical history: 80 years of age, male, post-op cv surgery. comparison: procedure…
Findings: frontal radiograph of the chest demonstrates repositioning of the right atrial lead possibly into the ivc. otherwise, there is unchanged life-support hardware. a right apical pneumothorax can be seen from the image. moderate right and small left pleural effusions continue. no pulmonary edema is observed. heart size is upper limits of normal.
Human Summary: pneumothorax is seen. bilateral pleural effusions continue.
Summary A (ROUGE-L = 0.77):
no pneumothorax is observed. bilateral pleural effusions continue.
Summary B (ROUGE-L = 0.44):
pneumothorax is observed on radiograph. bilateral pleural effusions continue to be seen.
Figure 1: An example radiology report and summaries with their ROUGE-L scores. Compared to the human-written summary, Summary A has high textual overlap (i.e., ROUGE-L) but makes a factual error; Summary B has a lower ROUGE-L score but is factually correct.

Neural abstractive summarization systems aim at generating sentences which compress a document by preserving the key facts in it (nallapati2016abstractive; see2017get; chen2018fast). These systems have been shown useful in many real-world applications. For example, zhang2018radsum have recently shown that customized neural abstractive summarization models are able to generate radiology summary statements with high quality by summarizing textual findings written by radiologists. This task has significant clinical value because the successful application of it has the potential to accelerate the radiology workflow, reduce repetitive human labor and improve clinical communications (kahn2009toward).

However, while existing abstractive summarization models are optimized to generate summaries that are relevant to the context and highly overlap with human references (paulus2018a), this does not guarantee factually correct summaries, as shown in Figure 1. Therefore, maintaining factual correctness of the generated summaries remains a critical yet unsolved problem. For example, zhang2018radsum showed that about 30% of the outputs from a radiology summarization model contain factual errors or inconsistencies. This has prevented the application of the system, as factual consistency is critically important in this domain to prevent medical errors.

Existing attempts at improving factual correctness of abstractive summarization models have achieved very limited success. For example, cao2017faithful proposed to augment the attention mechanism of neural models with factual triples extracted with open information extraction systems; falke2019ranking studied using natural language inference systems to rerank generated summaries based on their factual consistencies; kryciski2019evaluating proposed to verify factual consistency of generated summaries with a weakly-supervised model. Despite these efforts, even state-of-the-art systems trained with ample data still produce summaries with a substantial number of factual errors (goodrich2019assessing; kryciski2019neural).

In this work we aim to improve factual correctness of existing neural summarization systems, with a focus on summarizing radiology reports. This task has several key properties that make it ideal for studying factual correctness in summarization models. First, clinical facts or observations present in radiology reports have less ambiguity compared to open-domain text, which allows objective comparison of facts. Second, radiology reports involve a relatively limited space of facts, which makes automatic measurement of factual correctness in the generated text approachable. Lastly, as factual correctness is key to the success of the resulting system in this domain, improving factual correctness will directly lead to an ability to use the system.

To this end, we design a framework where an external information extraction system is used to extract information in the generated summary and produce a factual accuracy score by comparing it against the human reference summary. We further develop a training strategy where we combine a factual correctness objective, a textual overlap objective and a language model objective, and jointly optimize them via self-critical sequence training.

On two datasets of radiology reports collected from real hospitals, we show that our training strategy substantially improves the factual correctness of the summaries generated from a competitive neural summarization system. Interestingly, our experiments also show that even in the absence of a factual correctness objective, optimizing textual overlap substantially improves the factual correctness of the resulting system compared to traditional maximum likelihood training. We further show via human evaluation and analysis that our training strategy leads to summaries with higher overall quality and correctness, and are closer to the human-written ones.

Our main contributions are: (i) we propose a general framework and a training strategy for improving factual correctness of summarization models via reinforcement learning (RL); (ii) we apply the proposed strategy to the summarization of radiology reports, and empirically show that it improves the factual correctness of the generated summaries; (iii) we demonstrate via radiologist evaluation that our system is able to generate summaries with clinical validity and quality close to human-written ones. To our knowledge our work represents the first attempt at directly optimizing a neural summarization system with a factual correctness objective via RL.

2 Related Work

Neural Summarization Systems.

Neural models for text summarization can be broadly divided into extractive approaches (cheng2016neural; nallapati2016summarunner), where a system learns to select sentences from the context to form the summary; and abstractive approaches (chopra2016abstractive; nallapati2016abstractive; see2017get), where a system can generate new words and sentences to form the summary. While traditionally these models are often trained in an end-to-end manner by maximizing the likelihood of the reference summaries, RL has been shown useful in recent work (chen2018fast; dong2018banditsum). Specifically, paulus2018a found that directly optimizing an abstractive summarization model on the ROUGE metric via RL can improve the summary quality. Our work extends the ROUGE rewards used in existing work with a factual correctness reward to further improve the correctness of the generated summaries.

Factual Correctness in Summarization.

Our work is also closely related to recent work that studies factual correctness in summarization. cao2017faithful first proposed to improve the faithfulness of neural abstractive summarization models by attending to fact triples extracted from the context using open information extraction systems. goodrich2019assessing compared different information extraction systems to evaluate the factual accuracy of generated text. falke2019ranking studied whether existing natural language inference systems can be used to evaluate the factual correctness of generated summaries, and found models trained on existing datasets to be inadequate for this task. kryciski2019evaluating proposed to evaluate factual consistencies in the generated summaries using a weakly-supervised fact verification model.

Summarization of Radiology Reports.

Traditionally, existing work on summarizing radiology reports has been focused on the extraction of information from the reports (hripcsak2002use; hassanpour2016information). zhang2018radsum first studied the problem of automatic generation of radiology impressions by summarizing radiology findings, and showed that an augmented pointer-generator model is able to generate summaries which have high overlap with human references. macavaney2019ontology extended this model with an ontology-aware pointer-generator and showed improved summarization quality. jing2018automatic and li2019hybrid studied the problem of generating descriptions of radiology findings from medical images. While zhang2018radsum found that about 30% of the radiology summaries generated from neural models contain factual errors, methods to improve factual correctness in radiology summarization remain unstudied.

3 Task & Baseline Pointer-Generator

We start by briefly introducing the task of summarizing radiology findings. Given a passage of radiology findings represented as a sequence of tokens 𝐱={x1,x2,,xN}, with N being the length of the findings, the task involves finding a sequence of tokens 𝐲={y1,y2,,yL} that best summarizes the salient and clinically significant findings in 𝐱. In routine radiology workflow, an output sequence 𝐲 is produced by the radiologist, which we treat as a reference summary sequence.11 1 While the name “impression” is often used in clinical settings, we use “summary” and “impression” interchangeably.

To model the summarization process, we use the background-augmented pointer-generator network (zhang2018radsum) as the backbone of our method. This abstractive summarization model extends a pointer-generator model (see2017get) with a separate background section encoder and is shown to be effective in summarizing radiology notes with multiple sections. Here we briefly describe this model and refer readers to the original papers for full details.

At a high level, this model follows the encoder-decoder architecture, and first encodes the input sequence 𝐱 into hidden states with a Bi-directional Long Short-Term Memory (Bi-LSTM) network:

𝐡=Bi-LSTM(𝐱) (1)

Next, conditioned on 𝐡, the output sequence is decoded from an LSTM decoder. Formally, at the t-th step, given the previously generated token yt-1 and the previous decoder state st-1, the decoder calculates the current state st with:

st=LSTM(st-1,yt-1). (2)

To make the input information available at decoding time, an attention mechanism (bahdanau2014neural) is added to the decoder. The attention output and st are then used to predict the output word.

The baseline pointer-generator model by zhang2018radsum adds two augmentations to this attentional encoder-decoder model to make it suitable for summarizing radiology findings:

Copy Mechanism.

To enable the model to copy words from the input, a copy mechanism (vinyals2015pointer; see2017get) is added to calculate a generation probability at each step of decoding. This generation probability is then used to blend the original output vocabulary distribution and a copy distribution to generate the next word.

Background-guided Decoding.

As shown in Figure 1, radiology reports often consist of a background section which documents the crucial study background information (e.g., purpose of the study, patient conditions), and a findings section which documents clinical observations. While words can be copied from the findings section to form the summary, zhang2018radsum found it worked better to separately encode the background section and inject the representation into the decoding process. Specifically, the background section is encoded into a vector b with an attentional LSTM encoder. Then at each step of decoding, b is concatenated with the input word yt-1 to calculate the new state st as in Eq. (2).

4 Fact Checking in Summarization

Summarization models such as the one described in Section 3 are commonly trained with the teacher-forcing algorithm (williams1989learning) by maximizing the likelihood of the reference, human-written summaries. However, this training strategy results in a significant discrepancy between what the model sees during training and test time, often referred to as the exposure bias issue (ranzato2015sequence), leading to degenerate output at test time.

An alternative training strategy is to directly optimize standard metrics such as the ROUGE scores (lin2004rouge) with RL and it was shown to improve the quality of the generated summaries (paulus2018a). Nevertheless, this method still provides no guarantee that the generated summary is factually accurate and complete, since the ROUGE scores merely measure the superficial text overlap between two sequences and do not account for the factual alignment between them. To illustrate this, a reference sentence “pneumonia is seen” and a generated sentence “pneumonia is not seen” have substantial text overlap and thus the generated sentence would achieve a high ROUGE score, however the generated sentence conveys an entirely opposite fact. In this section we first introduce a method to verify the factual correctness of the generated summary against the reference summary, and then describe a training strategy to directly optimize a factual correctness objective to improve summary quality.

4.1 Evaluating Factual Correctness via Fact Extraction

A convenient way to explicitly measure the factual correctness of a generated summary against the reference is to first extract and represent the facts in a structured format. To this end, we define a fact extractor to be an information extraction (IE) module, noted as f, which takes in a summary sequence y and returns a structured fact vector 𝐯:

𝐯=f(y)=(v1,,vm) (3)

where vi is a variable that we want to measure via fact checking and m the total number of such variables. For example, in the case of summarizing radiology reports, vi can be a binary variable that describes whether an event or a disease such as pneumonia is present or not in a radiology study.

Given a fact vector 𝐯 output by f from a reference summary and 𝐯^ from a generated summary, we further define a factual accuracy score s to be the ratio of variables in 𝐯^ which equal the corresponding variables in 𝐯, namely:

s(𝐯^,𝐯)=i=1m𝟙[vi=v^i]m (4)

where s[0,1]. Note that this method requires a summary to be both precise and complete in order to achieve a high s score: missing out a positive variable or falsely claiming a negative variable will be equally penalized.

Our general definition of the fact extractor module f allows it to have different realizations for different domains. For our task of summarizing radiology findings, we make use of the open-source CheXpert radiology report labeler (irvin2019chexpert).22 2 https://github.com/stanfordmlgroup/chexpert-labeler At the core, the CheXpert labeler parses the input sentences into dependency structures and runs a series of surface and syntactic rules to extract the presence status of 14 clinical observations seen in chest radiology reports.33 3 For this study we used a subset of these variables and discuss the reasons in Appendix A. It was evaluated to have over 95% overall F1 when compared against oracle annotations from multiple radiologists on a large-scale radiology report dataset.

4.2 Improving Factual Correctness via Policy Learning

Figure 2: Our proposed training strategy. Compared to existing work which relies only on a ROUGE reward rR, we add a factual correctness reward rC which is enabled by a fact extractor. The summarization model is updated via RL, using a combination of the NLL loss, a ROUGE-based loss and a factual correctness-based loss. For simplicity we only show a subset of the clinical variables in the fact vectors 𝐯 and 𝐯^.

The fact extractor module introduced above not only enables us to measure the factual accuracy of a generated summary, but also provides us with an opportunity to directly optimize the factual accuracy as an objective. This can be achieved by viewing our summarization model as an agent, the actions of which are to generate a sequence of words to form the summary y^, conditioned on the input x.44 4 For clarity, we drop the bold symbol and use x and y to represent the input and output sequences, respectively. The agent then receives rewards r(y^) for its actions, where the rewards can be designed to measure the quality of the generated summary. Our goal is to learn an optimal policy Pθ(y|x) for the summarization model, parameterized by the network parameters θ, which achieves the highest expected reward under the training data.

Formally, we train our summarization model to minimize loss , the negative expectation of the reward r(y^) over the training data:

(θ)=-𝔼y^Pθ(y|x)[r(y^)]. (5)

According to the REINFORCE algorithm (williams1992simple), the gradient of this loss can be calculated as the following:

θ(θ)=-𝔼y^Pθ(y|x)[θlogPθ(y^|x)r(y^)]. (6)

Note that Eq. (6) involves an expectation over all possible sampled sequences y^ from the policy, which is difficult to calculate during training. In practice, we can approximate this gradient over a training example with a single Monte Carlo sample and deduct a baseline reward to reduce the variance of the gradient estimation:

θ(θ)-θlogPθ(y^s|x)(r(y^s)-r¯), (7)

where y^s is a sampled sequence from the network and r¯ a baseline reward. Practically there are many strategies for generating the baseline reward, and here we adopt the self-critical training strategy (rennie2017self), where we obtain the baseline reward r¯ by applying the same reward function r to a greedily decoded sequence y^g, i.e., r¯=r(y^g). We empirically find that the use of this self-critical baseline reward is key to the successful training of our summarization model.

4.3 Reward Function

The policy learning strategy in Eq. (7) provides us with the flexibility to optimize arbitrary reward functions. Here we decompose our reward function into two parts:

r=λ1rR+λ2rC, (8)

where rR[0,1] is a ROUGE reward, namely the ROUGE-L score (lin2004rouge) of the predicted sequence y^ against the reference y; rC[0,1] is a correctness reward, namely the factual accuracy s of the predicted sequence against the reference sequence, as in (4); λ1,λ2[0,1] are scalar weights that control the balance between the two.

paulus2018a found that directly optimizing a reward function without the original negative log-likelihood (NLL) objective as used in teacher-forcing can hurt the readability of the generated summaries, and proposed to alleviate this problem by combining the NLL objective with the RL loss. Here we adopt the same strategy, and our final loss during training is:

=NLL+λ1R+λ2C. (9)

Our overall training strategy is illustrated in Figure 2. Note that our final loss jointly optimizes three aspects of the summaries: NLL serves as a conditional language model that optimizes the fluency and relevance of the generated summary, R controls the brevity of the summary and encourages summaries which have high overlap with human references, and C encourages summaries that are factually accurate when compared against human references.

5 Experiments

We collected two real-world radiology report datasets and used them as our main training and evaluation corpora. We now describe the collection of them and the details of our experiments.

5.1 Data Collection

We collected all chest radiographic reports within a certain period of time from two hospitals: the Stanford University Hospital and the Rhode Island Hospital (RIH).

For both datasets, we ran simple preprocessing following zhang2018radsum. All reports were first tokenized with Stanford CoreNLP (manning2014stanford). We then filtered the datasets by excluding reports where (1) no findings or impression (i.e., summary) section can be found; (2) multiple findings or impression sections can be found but cannot be aligned; or (3) the findings have fewer than 10 words or the impression has fewer than 2 words. Lastly, we replaced all date and time mentions with special tokens (e.g., <DATE>).

To test the generalizability of the models, instead of using random stratification, we stratified each dataset over time into training, dev and test splits. We include statistics of both datasets in Table 1 and stratification details in Appendix B.

Number of Examples
Split Stanford RIH
Train 89,992 (68.8%) 84,194 (60.3%)
Dev 22,031 (16.8%) 25,966 (18.6%)
Test 18,827 (14.4%) 29,494 (21.1%)
Total 130,850 139,654
Table 1: Statistics of the Stanford and RIH datasets.
Stanford RIH
System R-1 R-2 R-L F1 R-1 R-2 R-L F1
LexRank (erkan2004lexrank) 26.8 16.3 23.6 - 20.6 10.7 18.3 -
BanditSum (dong2018banditsum) 32.7 20.9 29.0 - 26.1 14.0 23.3 -
PG Baseline (zhang2018radsum) 48.3 38.8 46.6 55.9 54.1 44.7 52.2 69.3
PG + RLR 52.0 41.1 49.5 63.2 58.0 47.2 55.7 73.3
PG + RLC 50.7 39.7 48.0 65.9 55.2 45.4 52.9 75.4
PG + RLR+C 52.0 41.0 49.3 64.5 57.0 46.6 54.7 74.8
Table 2: Main results on the Stanford and the RIH datasets. R-1, R-2, R-L represent the ROUGE scores and F1 represents the factual F1 score. PG Baseline represents our baseline augmented pointer-generator model; RLR, RLC and RLR+C represent RL training with the ROUGE reward alone, with the factual correctness reward alone and with both, respectively. All the ROUGE scores have a 95% confidence interval of at most ±0.60. F1 scores for extractive models were not evaluated for the reason discussed in Section 5.3.

5.2 Models

As we use the augmented pointer-generator network described in Section 3 as the backbone of our method, we mainly compare against it as the baseline model (PG Baseline), and uses the open implementation by zhang2018radsum.

For the proposed RL-based training, we compare three variants: training with only the ROUGE reward (RLR), with only the factual correctness reward (RLC), or with both (RLR+C). All three variants have the NLL component in the training loss as in Eq. (9). For all variants, we initialize the model with the best baseline model trained with standard teacher-forcing, and then finetune it on the training data with the corresponding RL loss, until it reaches the best validation score.

To understand the difficulty of the task and evaluate the necessity of using abstractive summarization models, we additionally evaluate two extractive summarization methods: (1) LexRank (erkan2004lexrank), a widely-used non-neural graph-based extractive summarization algorithm; and (2) BanditSum (dong2018banditsum), an RL-based neural extractive summarization model which achieves state-of-the-art results on the CNN/Daily Mail dataset (hermann2015teaching). For both of them we use open implementations.

We include other model implementation and training details in Appendix C.

5.3 Evaluation

We use two sets of metrics to evaluate model performance at the corpus level. First, we use the standard ROUGE scores (lin2004rouge), and report the F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L.

The second metric is a Factual 𝐅1 score. While the factual accuracy score s that we use in the reward function evaluates how factually accurate a specific summary is, comparing it at the corpus level can be misleading. To understand this, imagine the case where a clinical variable has rare presence in the corpus. A model which always generates a negative summary for it (i.e., the disease is not present) can have high accuracy, but is useless in practice. Instead, for each variable we obtain a model’s predictions over all test examples and calculate an F1 score for this variable. We then macro-average the F1 scores of all variables to obtain the overall factual F1 score of the model.

Note that the CheXpert labeler that we use is specifically designed to run on radiology summaries, which usually have a different style of language compared to the radiology findings section of the reports (see further analysis in Section 7). As a result, we found the labeler to be less accurate when applied to the findings section. For this reason, we were not able to calculate the factual F1 scores on the summaries generated by the two extractive summarization models.

6 Results

We first present our main results on the two collected datasets. We then present a human evaluation with board-certified radiologists where we compare the summaries generated by human, the baseline model and our proposed model.

6.1 Main Results

Our main results on the Stanford dataset and the RIH dataset are shown in Table 2. We first notice that while the neural extractive summarization model, BanditSum, outperforms the non-neural extractive method on ROUGE scores, the pointer-generator baseline substantially outperforms both of them, suggesting that on both datasets abstractive summarization is necessary to generate summaries comparable to human-written ones.

On the Stanford dataset, training the pointer-generator model with ROUGE reward alone (RLR) leads to improvements on all ROUGE scores, with a gain of 2.9 ROUGE-L scores. Training with the factual correctness reward alone (RLC) leads to the best overall factual F1 with a substantial gain of 10% absolute, however with consistent decline in the ROUGE scores compared to RLR training. Combining the ROUGE and the factual correctness rewards (RLR+C) achieves a balance between the two, leading to an overall improvement of 2.7 on ROUGE-L and 8.6% on factual F1 compared to the baseline. This indicates that RLR+C training leads to both higher overlap with references and improved factual correctness.

Surprisingly, while ROUGE has been criticized for its poor correlation with human judgment of quality and insufficiency for evaluating correctness of the generated text (novikova2017metric; chaganty2018price), we find that optimizing ROUGE reward jointly with NLL leads to substantially more factually correct summaries. This is shown by the notable gain of 7.3% factual F1 from the RLR training.

All of our findings are consistent on the RIH dataset, with RLR+C achieving an overall improvement of 2.5 on ROUGE-L and 5.5% on factual F1.

Variable PG Baseline RLR+C Δ
No Finding 77.3 81.5 0+4.2*
Cardiomegaly 29.5 40.4 +10.9*
Airspace Opacity 64.6 74.9 +10.3*
Edema 58.4 70.9 +12.5*
Consolidation 46.3 53.2 0+6.9*
Pneumonia 46.7 46.8 0+0.2
Atelectasis 48.8 56.3 0+7.5*
Pneumothorax 69.5 82.9 +13.4*
Pleural Effusion 62.0 73.4 +11.4*
Macro Avg. 55.9 64.5 0+8.6*
Table 3: Test set factual F1 scores for each variable on the Stanford dataset. * marks statistically significant improvements with p<.01 under a bootstrap test.
Stanford Dataset
Background: radiographic examination of the chest: <date> <time> am. clinical history: <age> years of age, female, wheezing, sob. comparison: <date> at <time>. procedure comments : two views of the chest…
Findings: continuous rhythm monitoring device again seen projecting over the left heart. persistent low lung volumes with unchanged cardiomegaly. again seen is a diffuse reticular pattern with interstitial prominence demonstrated represent underlying emphysematous changes with superimposed increasing moderate pulmonary edema. small bilateral pleural effusions. persistent bibasilar opacities left greater than right which may represent infection versus atelectasis.
Human: increased moderate pulmonary edema with small bilateral pleural effusions. left greater than right basilar opacities which may represent infection versus atelectasis.
PG Baseline (s=0.33): no significant interval change.
𝐑𝐋R+C (s=1.00): increasing moderate pulmonary edema. small bilateral pleural effusions. persistent bibasilar opacities left greater than right which may represent infection versus atelectasis.
RIH Dataset
Background: history: lobar pneumonia, unspecified organism; pneumonia of right middle lobe due to infectious organism. technique: frontal and lateral views of the chest…
Findings: lines/tubes: none. lungs: right middle lobe airspace disease seen on prior radiographs from <date> and <date> is no longer evident. bilateral lungs appear clear. pleura: there is no pleural effusion or pneumothorax. heart and mediastinum: no cardiomegaly. thoracic aorta appears calcified and mildly tortuous. bones: multilevel degenerative changes are seen throughout the thoracic spine. no wedge compression fractures are seen.
Human: no acute cardiopulmonary abnormality.
PG Baseline (s=0.75): right middle lobe airspace disease could represent atelectasis, aspiration or pneumonia.
𝐑𝐋R+C (s=1.00): no acute cardiopulmonary abnormality.
Figure 3: Example reports and system predictions from the Stanford and RIH test splits. Human reference, PG baseline output and RLR+C output are shown for each example. Factual accuracy scores (s) are also shown for the model outputs. For the Stanford example, clinical observations in the summaries are marked for clarity; for RIH, a wrongly copied observation and its occurence in the findings are marked.

Fine-grained Correctness.

To understand how improvements in individual variables contribute to the overall improvement, we show the fine-grained factual F1 scores for all variables on the Stanford dataset in Table 3 and include results on the RIH dataset in Appendix D. We find that on both datasets, improvements in RLR+C can be observed on all variables tested. We further find that, as we change the initialization across different training runs, while the overall improvement on factual F1 stays approximately unchanged, the distribution of the improvement on different variables can vary substantially. Developing a training strategy for fine-grained control over different variables will be an interesting direction for future work.

Qualitative Results.

We present two example reports along with the human reference summaries, the PG baseline outputs and RLR+C model outputs in Figure 3. In the first example, while the summary from the baseline model seems generic and does not include any meaningful observation, the summary from the RLR+C model aligns well with the human reference, and therefore achieves a higher factual accuracy score. In the second example, the baseline model wrongly copied an observation from the findings although the actual context is “no longer evident”, while the RLR+C model correctly recognizes this and produces a better summary.

Metric Win Tie Lose
Our Model vs. PG Baseline
Fluency 07% 60% 33%
Factual Correctness 31% 55% 14%
Overall Quality 48% 24% 28%
Our Model vs. Human Reference
Fluency 17% 54% 29%
Factual Correctness 23% 49% 28%
Overall Quality 44% 17% 39%
Table 4: Results of the radiologist evaluation. The top three rows present results when comparing our RLR+C model output versus the baseline model output; the bottom three rows present results when comparing our model output versus the human-written summaries.

6.2 Human Evaluation

To study whether the improvements in the factual correctness scores lead to improvement in summarization quality under expert judgment, we run a comparative human evaluation following previous work (chen2018fast; dong2018banditsum; zhang2018radsum). We sampled 50 test examples from the Stanford dataset, and for each example we presented to two board-certified radiologists the full radiology findings along with blinded summaries from (1) the human reference, (2) the PG baseline and (3) our RLR+C model. We shuffled the three summaries such that the correspondence cannot be guessed, and asked the radiologists to compare them based on the following three metrics: (1) fluency, (2) factual correctness and completeness, and (3) overall quality. For each metric we asked the radiologists to rank the three summaries, with ties allowed. After the evaluation, we converted each ranking into two binary comparisons: (1) our model versus the baseline model, and (2) our model versus human reference.

The results are shown in Table 4. Comparing our model against the baseline model, we find that: (1) in terms of fluency our model is less preferred, although a majority of the results (60%) are ties; (2) our model wins more on factual correctness and overall quality. Comparing our model against human references, we find that: (1) human wins more on fluency; (2) factual correctness results are close, with 72% of our model outputs being at least as good as human; (3) surprisingly, in terms of overall quality our model was preferred more by the radiologists than human references.

7 Analysis & Discussion

Fluency and Style of Summaries.

Our human evaluation results in Section 6.2 suggest that in terms of fluency our model output is less preferred than human reference and the baseline model output. To further understand the fluency and style of generations from different model at a larger scale, we trained a neural language model (LM) for radiology summaries following previous work in summarization (liu2018generating). Intuitively, radiology summaries which are more fluent and consistent with human in style should be able to achieve a lower perplexity under this in-domain LM, and vice versa. To this end, we collected all human-written summaries from the training and dev set of the Stanford dataset and the RIH dataset, which in total gives us about 222k summaries. We then trained a strong Mixture of Softmaxes LM (yang2017breaking) on this corpus, and evaluated the perplexity of test set outputs for all models.

The results are shown in Table 5. We find that while extractive models are able to generate summaries that have non-trivial overlap with human references, their perplexity scores tend to be much higher than human. We conjecture that this is because radiologists are trained to write the summaries with more compressed language than when they are writing the findings, therefore sentences directly extracted from the findings tend to be more verbose than needed.

System Stanford pplx. RIH pplx.
Human 06.7 05.5
LexRank 10.8 36.9
BanditSum 09.9 40.9
PG Baseline 04.8 03.8
PG + RLR+C 06.5 04.8
Table 5: Perplexity scores as evaluated by the trained radiology impression LM on the test set human references and model predictions.

We further observe that our baseline model trained with teacher-forcing achieves even lower perplexity than human, and the model trained with our proposed method has a perplexity score much closer to human references. We hypothesize that this is because models trained with teacher-forcing are prone to generic generations (therefore also leading to lower factual correctness), and training with the proposed rewards alleviates this issue, leading to summaries more consistent with human in style. For example, we find that “no significant interval change” is a very frequent generation from the baseline model, regardless of the actual findings in the input. On the Stanford dev set, this sentence shows up in 34% of the summaries generated by the baseline, while the number for RLR+C and human are only 24% and 17%, respectively. This hypothesis is further confirmed when we plot the distribution of the top 10 most frequent trigrams from different models in Figure 4: while the output from the baseline model heavily reuses the few most frequent trigrams, our model RLR+C tend to have more diverse summaries which are closer to human references. The same trends are observed for 4-grams and 5-grams.

{tikzpicture}{axis}

[ ybar=1pt, axis on top, height=4.cm, width=8cm, bar width=3pt, ymajorgrids, tick align=inside, major grid style=draw=none, ymin=0, ymax=4.0, axis x line*=bottom, axis y line*=left, y axis line style=opacity=0, tickwidth=0pt, tick label style=font=, enlarge x limits=true, legend style= at=(1,1), anchor=north east, legend columns=-1, font=, /tikz/every even column/.append style=column sep=0.2cm , ylabel=Ratio in outputs (%), ylabel style=yshift=-15pt, xlabel=Top 10 trigrams (most frequent on the left), xlabel style=yshift=10pt, label style=font=, symbolic x coords= 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, xtick=, ] \addplot[draw=none, fill=blue!80] coordinates (1, 1.21) (2, 1.17) (3, 1.16) (4, 1.10) (5, 0.45) (6, 0.42) (7, 0.36) (8, 0.36) (9, 0.35) (10, 0.35);

\addplot

[draw=none, fill=orange!80] coordinates (1, 2.78) (2, 2.71) (3, 1.70) (4, 1.69) (5, 0.54) (6, 0.50) (7, 0.47) (8, 0.43) (9, 0.40) (10, 0.39);

\addplot

[draw=none, fill=red!80] coordinates (1, 3.60) (2, 3.60) (3, 2.26) (4, 2.25) (5, 0.66) (6, 0.49) (7, 0.47) (8, 0.43) (9, 0.42) (10, 0.41);

\legend

Human, RLR+C, PG Baseline

Figure 4: Distributions of the top 10 most frequent trigrams from model outputs on the Stanford test set.

Limitations.

While we showed the success of our proposed method on improving the factual correctness of a radiology summarization model, we also recognize several limitations of our work. First, our proposed training strategy relies on an external IE module. While this IE module is relatively easy to implement for a domain with a limited space of facts, how to generalize this method to open-domain summarization remains unsolved. Second, our study was based on a rule-based IE system, and the use of a more robust statistical IE model can potentially improve the results. Third, we mainly focus on key factual errors which will result in a flip of the binary outcome of an event (e.g., presence of disease), whereas factual errors in generated summaries can occur in other forms such as wrong adjectives or coreference errors (kryciski2019neural). We leave the study of these problems to future work.

8 Conclusion

In this work we presented a general framework and a training strategy to improve the factual correctness of neural abstractive summarization models. We applied this approach to the summarization of radiology reports, and showed its success via both automatic and human evaluation on two separate datasets collected from real hospitals.

Our general takeaways include: (1) in a domain with a limited space of facts such as the radiology reports, a carefully implemented IE system can be used to improve the factual correctness of neural summarization models via RL; (2) even in the absence of a reliable IE system, optimizing the ROUGE metrics via RL can substantially improve the factual correctness of the generated summaries.

We hope that our work could draw the community’s attention to the factual correctness issue of abstractive summarization models and inspire future work on this direction.

Acknowledgments

We thank Peng Qi and Urvashi Khandelwal for their helpful suggestions, and Dr. Jonathan Movson for obtaining the RIH data.

References

Appendix A Clinical Variables Inclusion Criteria

While the CheXpert labeler that we use is able to extract status for 14 clinical variables, we found that several variables are very rarely represented in our corpora and therefore using all of them makes the calculation of the factual F1 score very unstable. For example, we found that training the same model using different random initializations would result in largely varied F1 scores for these variables. For this reason, for both datasets we removed from the factual F1 calculation all variables which have less than 3% positive occurrences on the validation set. We further removed the variables “Pleural Other” and “Support Devices” due to their ambiguity. This process results in a total of 9 variables for the Stanford dataset and 8 for the RIH dataset.

Additionally, apart from the positive and negative status, the CheXpert labeler is also able to generate an uncertain status for a variable, capturing observations with uncertainty, such as in the sentence “pneumonia is likely represented”. While we can modify the factual accuracy score to take uncertainty into account, for simplicity in this work we do not make the distinction between a positive status and an uncertain status.

Appendix B Dataset Stratification

For both the Stanford and the RIH datasets, we stratified them over time into training, dev and test splits. We show the time coverage of each split in Table 6.

Time Coverage
Split Stanford RIH
Train 2009/01 - 2014/04 2017/11 - 2018/06
Dev 2014/05 - 2014/08 2018/07 - 2018/09
Test 2014/09 - 2014/12 2018/10 - 2018/12
Table 6: Time coverage of different splits in the Stanford and RIH datasets.
Variable PG Baseline RLR+C Δ
No Finding 91.0 92.0 0+1.0*
Cardiomegaly 21.1 33.8 +12.7*
Airspace Opacity 80.4 83.5 0+3.1*
Edema 73.4 80.2 0+6.8*
Pneumonia 63.5 69.2 0+5.7*
Atelectasis 60.5 66.5 0+6.0*
Pneumothorax 89.7 93.2 0+3.5*
Pleural Effusion 74.3 79.9 0+5.6*
Macro Avg. 69.3 74.8 0+5.5*
Table 7: Test set performance for each variable on the RIH dataset. All numbers are F1 scores. * marks statistically significant improvements with p<.01 under a bootstrap test.

Appendix C Model Implementation and Training Details

For the baseline background-augmented pointer-generator model, we use its open implementation.55 5 https://github.com/yuhaozhang/summarize-radiology-findings We use a 2-layer LSTM as the findings encoder, 1-layer LSTM as the background encoder, and a 1-layer LSTM as the decoder. For all LSTMs we use a hidden size of 200. For the embedding layer we use 100-dimensional GloVe vectors (pennington2014glove) which we pretrained on about 4 million radiology reports. We apply dropout (srivastava2014dropout) with p=0.5 to the embeddings. At decoding time, we use the standard beam search with a beam size of 5 and a maximum decoding length of 50.

For the training and finetuning of the models, we use the Adam optimizer (kingma2014adam) with an initial learning rate of 1e-3. We use a batch size of 64 and clip the gradient with a norm of 5. During training we evaluate the model on the dev set every 500 steps and decay the learning rate by 0.5 whenever the validation score does not increase after 2500 steps. Since we want the model outputs to have both high overlap with the human references and high factual correctness, for training we always use the average of the dev ROUGE score and the dev factual F1 score as the stopping criteria. We tune the scalar weights in the loss function on the dev set, and use weights of 0.03, 0.97 and 0.97 for NLL, R and C, respectively.

For the extractive LexRank and BanditSum models, we use their open implementations.66 6 https://github.com/miso-belica/sumy; https://github.com/yuedongP/BanditSum For the BanditSum extractive summarization model, we use default values for all hyperparameters as in dong2018banditsum. For both models we select the top 3 scored sentences to form the summary.

For ROUGE evaluation, we use the Python ROUGE implementation released by Google Research.77 7 https://github.com/google-research/google-research/tree/master/rouge We empirically find it to provide very close results to the original Perl ROUGE implementation by lin2004rouge.

Appendix D Fine-grained Correctness Results on the RIH Dataset

We show the fine-grained factual F1 scores for all variables on the RIH dataset in Table 7.