A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Abstract

AI-driven discovery can greatly reduce design time and enhance newtherapeutics' effectiveness. Models using simulators explore broad designspaces but risk violating implicit constraints due to a lack of experimentalpriors. For example, in a new analysis we performed on a diverse set of modelson the GuacaMol benchmark using supervised classifiers, over 60\% of moleculesproposed had high probability of being mutagenic. In this work, we introduce\ourdataset, a dataset of priors for design problems extracted from literaturedescribing compounds used in lab settings. It is constructed with LLM pipelinesfor discovering therapeutic entities in relevant paragraphs and summarizinginformation in concise fair-use facts. \ourdataset~ consists of 32.3 millionpairs of natural language facts, and appropriate entity representations (i.e.SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM,CLIP, and LLava architectures to reason jointly about text and design targetsand evaluate on tasks from the Therapeutic Data Commons (TDC). \ourdataset~ishighly effective for creating models with strong priors: in supervisedprediction problems that use our data as pretraining, our best models with 15Mlearnable parameters outperform larger 2B TxGemma on both regression andclassification TDC tasks, and perform comparably to 9B models on average.Models built with \ourdataset~can be used as constraints while optimizing fornovel molecules in GuacaMol, resulting in proposals that are safer and nearlyas effective. We release our dataset at\href{https://huggingface.co/datasets/medexanon/Medex}{huggingface.co/datasets/medexanon/Medex},and will provide expanded versions as available literature grows.

Quick Read (beta)

loading the full paper ...