Semi-Supervised Natural Language Approach for Fine-Grained Classification of Medical Reports

Abstract

Although machine learning has become a powerful tool to augment doctors inclinical analysis, the immense amount of labeled data that is necessary totrain supervised learning approaches burdens each development task as time andresource intensive. The vast majority of dense clinical information is storedin written reports, detailing pertinent patient information. The challenge withutilizing natural language data for standard model development is due to thecomplex nature of the modality. In this research, a model pipeline wasdeveloped to utilize an unsupervised approach to train an encoder-languagemodel, a recurrent network, to generate document encodings; which then can beused as features passed into a decoder-classifier model that requiresmagnitudes less labeled data than previous approaches to differentiate betweenfine-grained disease classes accurately. The language model was trained onunlabeled radiology reports from the Massachusetts General Hospital RadiologyDepartment (n=218,159) and terminated with a loss of 1.62. The classificationmodels were trained on three labeled datasets of head CT studies of reportedpatients, presenting large vessel occlusion (n=1403), acute ischemic strokes(n=331), and intracranial hemorrhage (n=4350), to identify a variety ofdifferent findings directly from the radiology report data; resulting in AUCsof 0.98, 0.95, and 0.99, respectively, for the large vessel occlusion, acuteischemic stroke, and intracranial hemorrhage datasets. The output encodings areable to be used in conjunction with imaging data, to create models that canprocess a multitude of different modalities. The ability to automaticallyextract relevant features from textual data allows for faster model developmentand integration of textual modality, overall, allowing clinical reports tobecome a more viable input for more encompassing and accurate deep learningmodels.

Quick Read (beta)

loading the full paper ...