Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records

Abstract

Background Clinical studies using real-world data may benefit from exploiting clinicalreports, a particularly rich albeit unstructured medium. To that end, naturallanguage processing can extract relevant information. Methods based on transferlearning using pre-trained language models have achieved state-of-the-artresults in most NLP applications; however, publicly available models lackexposure to speciality-languages, especially in the medical field. Objective We aimed to evaluate the impact of adapting a language model to Frenchclinical reports on downstream medical NLP tasks. Methods We leveraged a corpus of 21M clinical reports collected from August 2017 toJuly 2021 at the Greater Paris University Hospitals (APHP) to produce twoCamemBERT architectures on speciality language: one retrained from scratch andthe other using CamemBERT as its initialisation. We used two French annotatedmedical datasets to compare our language models to the original CamemBERTnetwork, evaluating the statistical significance of improvement with theWilcoxon test. Results Our models pretrained on clinical reports increased the average F1-score onAPMed (an APHP-specific task) by 3 percentage points to 91%, a statisticallysignificant improvement. They also achieved performance comparable to theoriginal CamemBERT on QUAERO. These results hold true for the fine-tuned andfrom-scratch versions alike, starting from very few pre-training samples. Conclusions We confirm previous literature showing that adapting generalist pre-trainlanguage models such as CamenBERT on speciality corpora improves theirperformance for downstream clinical NLP tasks. Our results suggest thatretraining from scratch does not induce a statistically significant performancegain compared to fine-tuning.

Quick Read (beta)

loading the full paper ...