Chest radiography is an extremely powerful imaging modality, allowing for adetailed inspection of a patient's thorax, but requiring specialized trainingfor proper interpretation. With the advent of high performance general purposecomputer vision algorithms, the accurate automated analysis of chestradiographs is becoming increasingly of interest to researchers. However, a keychallenge in the development of these techniques is the lack of sufficientdata. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chestx-rays associated with 227,827 imaging studies sourced from the Beth IsraelDeaconess Medical Center between 2011 - 2016. Images are provided with 14labels derived from two natural language processing tools applied to thecorresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirelyfrom the MIMIC-CXR database, and aims to provide a convenient processed versionof MIMIC-CXR, as well as to provide a standard reference for data splits andimage labels. All images have been de-identified to protect patient privacy.The dataset is made freely available to facilitate and encourage a wide rangeof research in medical computer vision.
Quick Read (beta)
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient’s thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirely from the MIMIC-CXR database, and aims to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in medical computer vision.
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Alistair E. W. Johnson1,
Tom J. Pollard1,
Nathaniel R. Greenbaum3,
Matthew P. Lungren4,
Yifan Peng6, Zhiyong Lu6, Roger G. Mark1, Seth J. Berkowitz2, Steven Horng3
1 Institute of Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA
2 Department of Radiology, Beth Israel Deaconess Medical Center, Boston, MA, USA
3 Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
4 Department of Radiology, Stanford University, Palo Alto, CA, USA
5 Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
6 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, USA
Correspondence to: [email protected]
Keywords healthcare radiology computer vision natural language processing
Chest radiography is a common imaging modality used to assess the thorax and the most common medical imaging study in the world. Chest radiographs are used to identify acute and chronic cardiopulmonary conditions, verify that devices such as pacemakers, central lines, and tubes are correctly positioned, and to assist in related medical workups. In the U.S., the number of radiologists as a percentage of the physician workforce is decreasing , and the geographic distribution of radiologists favors larger, more urban counties . Delays and backlogs in timely medical imaging interpretation have demonstrably reduced care quality in such large health organizations as the U.K. National Health Service  and the U.S. Department of Veterans Affairs . The situation is even worse in resource-poor areas, where radiology services are extremely scarce. As of 2015, only 11 radiologists served the 12 million people of Rwanda , while the entire country of Liberia, with a population of four million, had only two practicing radiologists . Accurate automated analysis of radiographs has the potential to improve the efficiency of radiologist workflow and extend expertise to under-served regions.
The combination of burgeoning datasets with increasingly sophisticated algorithms has resulted in a number of significant advances in other application areas of computer vision [5, 19]. A key requirement in the application of these advances to automated chest radiograph analysis is sufficient data. Over time, progressively larger databases have been made available. The Japanese Society of Radiological Technology (JSRT) Database contains 247 images with labels of chest nodules as confirmed by subsequent computed tomography (CT) . Notably, the dataset is provided with annotations segmenting the lungs and heart. The Open-I Indiana University Chest X-ray dataset contains 8,121 images associated with 3,996 de-identified radiology reports . More recently, the NIH released ChestX-ray14 (originally ChestX-ray8), a collection of 112,120 frontal chest radiographs from 30,805 distinct patients with 14 binary labels indicating existence pathology or lack of pathology .
The MIMIC-CXR database aimed to galvanize research around automated analysis of chest radiographs. The chest radiographs in MIMIC-CXR are published in DICOM format, which is commonly used in clinical practice. DICOM is a well defined binary file format which stores a large amount of meta-data with the pixel values of the image. Unfortunately, due to the complexity of the application domain (radiology), the DICOM file format can be difficult to comprehend, creating an undesirable barrier for those traditionally outside of the medical domain. Outside of radiology, digital images tend to be stored using one of a number of more common general purpose formats. One particularly common format, JPEG, achieves significant savings in image storage size using a lossy compression algorithm. While the loss of information is undesirable, the benefits of a reduced image storage size are many and so the JPG image format remains popular among computer vision researchers.
The primary goal of the MIMIC-CXR-JPG database is to provide a standard reference for JPEG images derived from the DICOM files. As DICOMs contain higher pixel depth than can be perceived by the human eye, a design decision must be made in converting the 16-bit depth raw images into 12-bit depth images in JPEG format. Furthermore, a number of image pixel normalization strategies are employed in computer vision, and providing the most common approach as a reference database saves researchers time and makes it easier to compare derivative works. Images are also provided with one or more labels derived from the corresponding free-text radiology report using open source labelers [12, 6]. While other researchers can derive structured labels from the free-text radiology reports in MIMIC-CXR, providing labels here ensures their derivation is consistent.
MIMIC-CXR-JPG is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. Randomly generated identifiers are used to group distinct reports and patients.
2 Chest radiographs
Chest radiographs were sourced from the hospital picture archiving and communication system (PACS) in Digital Imaging and Communications in Medicine (DICOM) format. All studies for patients admitted to the emergency department between 2011 - 2016 were queried. Images were linked to corresponding radiology reports using the hospital’s radiology information system. Images sometimes contain “burned in” annotations: areas where pixel values have been modified after image acquisition in order to display text. Annotations contain relevant information including: image orientation, anatomical position of the subject, timestamp of image capture, and so on. The resulting image, with annotations encoded within the pixel themselves, is then transferred from the modality to PACS. Since the annotations are applied at the modality, it is impossible to recover the original image without annotations. As all patient PHI must be removed to satisfy HIPAA Safe Harbor, images were de-identified using a custom algorithm which removed dates and patient identifiers, but retained radiologically relevant information such as orientation. The algorithm applied an ensemble of image preprocessing and optical character recognition approaches to detect text within an image. Text was identified due to its significant contrast with the background, and due to its consistent pixel value within an image. Suspected PHI was removed by setting all pixel values in a bounding box encompassing the PHI to black. Subsequent to deidentification, we manually reviewed 6,900 radiographs for PHI. Each image was reviewed by two independent annotators. 180 images were identified for a secondary consensus review; none of which ultimately had PHI. The most common causes for annotators to request consensus review were: (1) existence of a support device such as a pacemaker, (2) text identifying in-hospital location (e.g. “MICU”), and (3) obscure text relating to radiograph technique (e.g. “prt rr slot 11”).
After de-identification, images were exported in the JPEG standard format. First, the image pixels were extracted from the DICOM file using the pydicom library . Pixel values were normalized to the range [0, 255] by subtracting the lowest value in the image, dividing by the highest value in the shifted image, truncating values, and converting the result to an unsigned integer. The DICOM field PhotometricInterpretation was used to determine whether the pixel values were inverted, and if necessary images were inverted such that air in the image appears white (highest pixel value), while the outside of the patient’s body appears black (lowest pixel value). The OpenCV library was then used to histogram equalize the image with the intention of enhancing contrast. Histogram equalization involves shifting pixel values towards 0 or towards 255 such that all pixel values 0 through 255 have approximately equal frequency. Images were then converted to JPEG files using OpenCV with a quality factor of 95. Pixel data were normalized to the unit interval, and bit-depth was subsequently scaled to 8-bit (0-255). If necessary, image intensity values were inverted to the ensure the image transitioned from dark to bright as pixel value increased. Histogram equalization was then applied, and the image was written out in the compressed JPEG format with a quality value of 95.
Note that, aside from de-identification and conversion to JPEG, no filtering or processing of the images was performed. Figure 1 provides a comparison of an image read directly from the DICOM and the histogram equalized JPEG format file. The default parameters were used to display the DICOM image.
3 Labeling of the reports
Radiology reports at the source hospital are semi-structured, with radiologists documenting their interpretations in titled sections. The structure of these reports are generally consistent through the use of standardized documentation templates, though can drift over time as the template changed. There can also be some inter-reporter variability as the structure of the reports are not enforced by the user interface and can be overridden by the user. The two primary sections of interest are findings; a natural language description of the important aspects in the image, and impression; a short summary of the most immediately relevant findings. Labels for the images were derived from either the impression section, the findings section (if impression was not present), or the final section of the report (if neither impression nor findings sections were present). Of the total 227,835 reports, 189,561 (83.2%) had an impression section, 27,684 (12.2%) had a findings section, and 10,514 (4.6%) had an equivalent section not explicitly labeled as findings or impression. 8 of the reports did not have text for labeling.
Labels were derived using two open source labeler tools; NegBio and CheXpert [12, 6]. NegBio is an open-source rule based tool for negation and uncertain detection in radiology reports. NegBio takes as input a sentence with pre-tagged mentions of medical findings, and determines whether a specific finding is negative or uncertain. More detail is provided in the NegBio article . CheXpert is a rule based classifier which proceeds in three stages: (1) extraction, (2) classification, and (3) aggregation. In the extraction stage, all mentions of a label are identified, including alternate spellings, synonyms, and abbreviations (e.g. for pneumothorax, the words “pneumothoraces” and “ptx” would also be captured). Mentions are then classified as positive, uncertain, or negative using local context. Finally, aggregation is necessary as there may be multiple mentions of a label. More detail is provided in the CheXpert article .
In the labeling of reports for MIMIC-CXR-JPG, the mention patterns defined in CheXpert were used for NegBio. As such, the categories overlap with the CheXpert dataset , but not the ChestX-ray 14 dataset . Example reports with labels are shown in Table 1. Table 2 shows the frequency of various labels in the reports in the majority subset of the images. A fourth category, “Disagreement”, highlights instances where the CheXpert and NegBio tools disagreed on the label.
|Impression||No evidence of acute cardiopulmonary process.||No Finding|
|Findings||The left lung is relatively well aerated and clear. The right hemithorax is markedly opacified with volume loss, circumferential pleural thickening and pleural fluid with near complete opacification of the right lung with right basal pleural catheter noted. Hydropneumothorax previously seen is not as well evaluated on this not fully upright film. Cardiac contours are somewhat obscured but unremarkable.||No Cardiomegaly|
|No Enlarged Cardiomediastinum|
|Other||Cardiac size is top normal. Bibasilar opacities, larger on the left side, could be due to atelectasis but superimposed infection cannot be excluded. If any, there is a small right pleural effusion. There is elevation of the right hemidiaphragm. There is mild vascular congestion.||No Cardiomegaly|
|Atelectasis||45,088 (19.8%)||937.0 (0.4%)||9,897.0 (4.3%)||1,744 (0.8%)|
|Cardiomegaly||39,094 (17.2%)||15,860.0 (7.0%)||5,924.0 (2.6%)||5,924 (2.6%)|
|Consolidation||10,487 (4.6%)||7,939.0 (3.5%)||3,022.0 (1.3%)||1,628 (0.7%)|
|Edema||26,455 (11.6%)||25,246.0 (11.1%)||11,781.0 (5.2%)||2,351 (1.0%)|
|Enlarged Cardiomediastinum||7,004 (3.1%)||5,271.0 (2.3%)||9,307.0 (4.1%)||255 (0.1%)|
|Fracture||3,768 (1.7%)||880.0 (0.4%)||299.0 (0.1%)||884 (0.4%)|
|Lung Lesion||6,129 (2.7%)||842.0 (0.4%)||1,020.0 (0.4%)||296 (0.1%)|
|Lung Opacity||50,916 (22.3%)||2,868.0 (1.3%)||2,110.0 (0.9%)||2,531 (1.1%)|
|No Finding||75,163 (33.0%)||-||-||3,906 (1.7%)|
|Pleural Effusion||53,188 (23.3%)||27,072.0 (11.9%)||5,345.0 (2.3%)||1,667 (0.7%)|
|Pleural Other||1,961 (0.9%)||120.0 (0.1%)||728.0 (0.3%)||93 (0.0%)|
|Pneumonia||15,769 (6.9%)||24,205.0 (10.6%)||17,789.0 (7.8%)||1,422 (0.6%)|
|Pneumothorax||9,317 (4.1%)||42,335.0 (18.6%)||868.0 (0.4%)||1,328 (0.6%)|
|Support Devices||65,637 (28.8%)||3,070.0 (1.3%)||96.0 (0.0%)||1,831 (0.8%)|
4 Training, validation, and test sets
To ensure consistent evaluation of models, we have organized the data into training, validation, and test sets. The test set contains all studies for patients who had at least one report labelled in our manual review. We are not publicly releasing the test set. The validation set contains a random set of 500 patients and all of their associated studies. This set is made publicly available in a separate ‘valid’ folder. Finally, all remaining studies are made available in the training set. Table 3 provides summary information for studies in the three datasets. Note the enrichment of findings in the test set caused by the stratified sampling done to ensure sufficient coverage of all pathologies.
|Number of images||368960||2991||5159|
|Frontal||248020 (67.2%)||2041 (68.2%)||3653 (70.8%)|
|Lateral||120795 (32.7%)||949 (31.7%)||1502 (29.1%)|
|Other||145 (0.0%)||1 (0.0%)||4 (0.1%)|
|Number of studies||222758||1808||3269|
|with a finding||170420 (76.5%)||1394 (77.1%)||2912 (89.1%)|
|Number of patients||64586||500||293|
|with a finding||44157 (68.4%)||344 (68.8%)||288 (98.3%)|
5 Validation of labels
A random set of reports were selected for validation. Stratified sampling was used to ensure adequate capture of the various pathologies. A total of 687 reports were reviewed by a board certified radiologist with 8 years experience (ML) and manually labeled according to the 14 categories in CheXpert. The labeling process followed guidelines set forth by the authors of the CheXpert labeler and described therein .
The two label algorithms were evaluated in three tasks: mention extraction, negation detection, and uncertainty detection. For the mention extraction task, any assigned label (positive, negative, or uncertain) is considered a positive prediction, while blank (no mention) is considered a negative prediction. For negation detection, negated labels are positive while all other labels are negative. Finally, for uncertainty detection, uncertain labels are positive while all other labels are negative. The harmonic mean of the sensitivity and positive predictive value, referred to as the F1 score, was calculated for each group independently. Table 4 lists the performance for the mentioning of a label, Table 5 lists the performance for uncertainty classification, and Table 6 lists the performance for negation classification.
|Finding||Precision||Recall||F1||Number of positive cases|
6 Data availability
All data is made available on PhysioNet11 1 https://www.physionet.org/content/mimic-cxr-jpg/ [7, 4]. Use of the dataset is free to all researchers after signing of a data use agreement which stipulates, among other items, that (1) the user will not share the data, (2) the user will make no attempt to reidentify individuals, and (3) any publication which makes use of the data will also make the relevant code available.
MIMIC-CXR-JPG is wholly derived from MIMIC-CXR22 2 https://www.physionet.org/content/mimic-cxr/ . The source data, MIMIC-CXR, contains the same images in DICOM format with the free-text radiology reports which were the source of the labels. Due to the sensitivity of this dataset, access will require completion of a training course in human subjects research, as is the process for MIMIC-III  and eICU-CRD .
Code used to generate MIMIC-CXR-JPG and the summaries in this paper has been made publicly available33 3 https://github.com/MIT-LCP/mimic-cxr/ .
MIMIC-CXR-JPG is a large, publicly available dataset of chest radiographs from over 220,000 studies performed at the BIDMC. The dataset contains labels for a number of common pathologies and will provide a benchmark for a number of medically relevant computer vision tasks.
We would like to acknowledge the Stanford Machine Learning Group and the Stanford AIMI center for their help in running the chexpert labeler and for their insight into the work; in particular we would like to thank Jeremy Irvin and Pranav Rajpurkar. We would also like to acknowledge the BIDMC for their continued collaboration.
This work was supported by grant NIH-R01-EB017205 from the National Institutes of Health. This work was also supported by the Intramural Research Programs and grant K99LM013001 of the NIH National Library of Medicine.
The MIT Laboratory for Computational Physiology received funding from Philips Healthcare to create the database described in this paper.
-  (2015) Diagnostic radiology in liberia: a country report. Journal of Global Radiology 1 (2), pp. 6. Cited by: §1.
-  (2017) Improving patient safety: avoiding unread imaging exams in the national va enterprise electronic health record. Journal of digital imaging 30 (3), pp. 309–313. Cited by: §1.
-  (2015) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. Cited by: §1.
-  (2000) PhysioBank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: §6.
-  (2009) The unreasonable effectiveness of data. IEEE Intelligent Systems 24 (2), pp. 8–12. Cited by: §1.
-  (2019) CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence, Cited by: §1, §3, §3, §5.
-  (2019) MIMIC-CXR-JPG Database. PhysioNet. External Links: Cited by: §6.
-  (2019) MIMIC-CXR Database. PhysioNet. External Links: Cited by: §6.
-  (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §6.
-  (2019) The MIMIC-CXR Code Repository v2.0.0. Zenodo. External Links: Cited by: §6.
-  (2019-07) Pydicom v1.3.0. Zenodo. External Links: Cited by: §2.
-  (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings 2017, pp. 188. Cited by: §1, §3.
-  (2018) The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data 5. Cited by: §6.
-  (2017) Radiologist shortage leaves patient care at risk, warns royal college. BMJ: British Medical Journal (Online) 359. Cited by: §1.
-  (2015) The us radiologist workforce: an analysis of temporal and geographic variation by using large national datasets. Radiology 279 (1), pp. 175–184. Cited by: §1.
-  (2018) A county-level analysis of the us radiologist workforce: physician supply and subspecialty characteristics. Journal of the American College of Radiology 15 (4), pp. 601–606. Cited by: §1.
-  (2015) Imaging in the land of 1000 hills: rwanda radiology country report. Journal of Global Radiology 1 (1), pp. 5. Cited by: §1.
-  (2000) Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. American Journal of Roentgenology 174 (1), pp. 71–74. Cited by: §1.
-  (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843–852. Cited by: §1.
-  (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3462–3471. Cited by: §1, §3.