Abstract
Entity recognition is a critical first step to a number of clinical NLPapplications, such as entity linking and relation extraction. We present thefirst attempt to apply state-of-the-art entity recognition approaches on anewly released dataset, MedMentions. This dataset contains over 4000 biomedicalabstracts, annotated for UMLS semantic types. In comparison to existingdatasets, MedMentions contains a far greater number of entity types, and thusrepresents a more challenging but realistic scenario in a real-world setting.We explore a number of relevant dimensions, including the use of contextualversus non-contextual word embeddings, general versus domain-specificunsupervised pre-training, and different deep learning architectures. Wecontrast our results against the well-known i2b2 2010 entity recognitiondataset, and propose a new method to combine general and domain-specificinformation. While producing a state-of-the-art result for the i2b2 2010 task(F1 = 0.90), our results on MedMentions are significantly lower (F1 = 0.63),suggesting there is still plenty of opportunity for improvement on this newdata.