BioMegatron: Larger Biomedical Domain Language Model

Abstract

There has been an influx of biomedical domain-specific language models,showing language models pre-trained on biomedical text perform better onbiomedical domain benchmarks than those trained on general domain text corporasuch as Wikipedia and Books. Yet, most works do not study the factors affectingeach domain language application deeply. Additionally, the study of model sizeon domain-specific models has been mostly missing. We empirically study andevaluate several factors that can affect performance on domain languageapplications, such as the sub-word vocabulary set, model size, pre-trainingcorpus, and domain transfer. We show consistent improvements on benchmarks withour larger BioMegatron model trained on a larger domain corpus, contributing toour understanding of domain language model applications. We demonstratenoticeable improvements over the previous state-of-the-art (SOTA) on standardbiomedical NLP benchmarks of named entity recognition, relation extraction, andquestion answering. Model checkpoints and code are available at[ngc.nvidia.com] and [github.com/NVIDIA/NeMo].

Quick Read (beta)

loading the full paper ...