AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Abstract

Researchers worldwide are seeking to repurpose existing drugs or discover newdrugs to counter the disease caused by severe acute respiratory syndromecoronavirus 2 (SARS-CoV-2). A promising source of candidates for such studiesis molecules that have been reported in the scientific literature to bedrug-like in the context of coronavirus research. We report here on a projectthat leverages both human and artificial intelligence to detect references todrug-like molecules in free text. We engage non-expert humans to create acorpus of labeled text, use this labeled corpus to train a named entityrecognition model, and employ the trained model to extract 10912 drug-likemolecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of198875 papers. Performance analyses show that our automated extraction modelcan achieve performance on par with that of non-expert humans.

Quick Read (beta)

loading the full paper ...