Abstract
In this study, we investigate the potential of Large Language Models tocomplement biomedical knowledge graphs in the training of semantic models forthe biomedical and clinical domains. Drawing on the wealth of the UMLSknowledge graph and harnessing cutting-edge Large Language Models, we propose anew state-of-the-art approach for obtaining high-fidelity representations ofbiomedical concepts and sentences, consisting of three steps: an improvedcontrastive learning phase, a novel self-distillation phase, and a weightaveraging phase. Through rigorous evaluations via the extensive BioLORD testingsuite and diverse downstream tasks, we demonstrate consistent and substantialperformance improvements over the previous state of the art (e.g. +2pts onMedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our newstate-of-the-art biomedical model for English, we also distill and release amultilingual model compatible with 50+ languages and finetuned on 7 Europeanlanguages. Many clinical pipelines can benefit from our latest models. Our newmultilingual model enables a range of languages to benefit from ouradvancements in biomedical semantic representation learning, opening a newavenue for bioinformatics researchers around the world. As a result, we hope tosee BioLORD-2023 becoming a precious tool for future biomedical applications.