Pretraining large neural language models, such as BERT, has led to impressivegains on many natural language processing (NLP) tasks. However, mostpretraining efforts focus on general domain corpora, such as newswire and Web.A prevailing assumption is that even domain-specific pretraining can benefit bystarting from general-domain language models. In this paper, we challenge thisassumption by showing that for domains with abundant unlabeled text, such asbiomedicine, pretraining language models from scratch results in substantialgains over continual pretraining of general-domain language models. Tofacilitate this investigation, we compile a comprehensive biomedical NLPbenchmark from publicly-available datasets. Our experiments show thatdomain-specific pretraining serves as a solid foundation for a wide range ofbiomedical NLP tasks, leading to new state-of-the-art results across the board.Further, in conducting a thorough evaluation of modeling choices, both forpretraining and task-specific fine-tuning, we discover that some commonpractices are unnecessary with BERT models, such as using complex taggingschemes in named entity recognition (NER). To help accelerate research inbiomedical NLP, we have released our state-of-the-art pretrained andtask-specific models for the community, and created a leaderboard featuring ourBLURB benchmark (short for Biomedical Language Understanding & ReasoningBenchmark) at https://aka.ms/BLURB.