CBAG: Conditional Biomedical Abstract Generation

Abstract

Biomedical research papers use significantly different language and jargonwhen compared to typical English text, which reduces the utility of pre-trainedNLP models in this domain. Meanwhile Medline, a database of biomedicalabstracts, introduces nearly a million new documents per-year. Applicationsthat could benefit from understanding this wealth of publicly availableinformation, such as scientific writing assistants, chat-bots, or descriptivehypothesis generation systems, require new domain-centered approaches. Aconditional language model, one that learns the probability of words given somea priori criteria, is a fundamental building block in many such applications.We propose a transformer-based conditional language model with a shallowencoder "condition" stack, and a deep "language model" stack of multi-headedattention blocks. The condition stack encodes metadata used to alter the outputprobability distribution of the language model stack. We sample thisdistribution in order to generate biomedical abstracts given only a proposedtitle, an intended publication year, and a set of keywords. Using typicalnatural language generation metrics, we demonstrate that this proposed approachis more capable of producing non-trivial relevant entities within the abstractbody than the 1.5B parameter GPT-2 language model.

Quick Read (beta)

loading the full paper ...