BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Abstract

Biomedical text mining is becoming increasingly important as the number ofbiomedical documents rapidly grows. With the progress in machine learning,extracting valuable information from biomedical literature has gainedpopularity among researchers, and deep learning has boosted the development ofeffective biomedical text mining models. However, as deep learning modelsrequire a large amount of training data, applying deep learning to biomedicaltext mining is often unsuccessful due to the lack of training data inbiomedical fields. Recent researches on training contextualized languagerepresentation models on text corpora shed light on the possibility ofleveraging a large number of unannotated biomedical text corpora. We introduceBioBERT (Bidirectional Encoder Representations from Transformers for BiomedicalText Mining), which is a domain specific language representation modelpre-trained on large-scale biomedical corpora. Based on the BERT architecture,BioBERT effectively transfers the knowledge from a large amount of biomedicaltexts to biomedical text mining models with minimal task-specific architecturemodifications. While BERT shows competitive performances with previousstate-of-the-art models, BioBERT significantly outperforms them on thefollowing three representative biomedical text mining tasks: biomedical namedentity recognition (1.86% absolute improvement), biomedical relation extraction(3.33% absolute improvement), and biomedical question answering (9.61% absoluteimprovement). We make the pre-trained weights of BioBERT freely available athttps://github.com/naver/biobert-pretrained, and the source code forfine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

Quick Read (beta)

loading the full paper ...