Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics

  • 2021-02-18 18:50:31
  • Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, Marinka Zitnik
Machine learning for therapeutics is an emerging field with incredibleopportunities for innovation and expansion. Despite the initial success, manykey challenges remain open. Here, we introduce Therapeutics Data Commons (TDC),the first unifying framework to systematically access and evaluate machinelearning across the entire range of therapeutics. At its core, TDC is acollection of curated datasets and learning tasks that can translatealgorithmic innovation into biomedical and clinical implementation. To date,TDC includes 66 machine learning-ready datasets from 22 learning tasks,spanning the discovery and development of safe and effective medicines. TDCalso provides an ecosystem of tools, libraries, leaderboards, and communityresources, including data functions, strategies for systematic modelevaluation, meaningful data splits, data processors, and molecule generationoracles. All datasets and learning tasks are integrated and accessible via anopen-source library. We envision that TDC can facilitate algorithmic andscientific advances and accelerate development, validation, and transition intoproduction and clinical implementation. TDC is a continuous, open-sourceinitiative, and we invite contributions from the research community. TDC ispublicly available at


