Abstract
Motivation: Computational models that accurately identify high-affinityprotein-chemical pairs can accelerate drug discovery pipelines. These models,trained on available protein-chemical interaction datasets, can be used topredict the binding affinity of an input protein-chemical pair. However, thetraining datasets may contain surface patterns, called dataset biases, whichcause models to memorize dataset-specific biomolecule properties, instead oflearning binding mechanisms. As a result, the prediction performance of modelsdrops for unseen biomolecules. Here, we present DebiasedDTA, a noveldrug-target affinity (DTA) prediction model training framework that addressesdataset biases to improve affinity prediction for novel biomolecules.DebiasedDTA uses ensemble learning and sample weight adaptation to identify andavoid biases and is applicable to most DTA prediction models. Results: Theresults show that DebiasedDTA can boost models while predicting theinteractions between unseen biomolecules. In addition, prediction performancefor seen biomolecules also improves. The experiments also show that DebiasedDTAcan augment DTA prediction models of different input and model structures andis able to avoid biases of different sources. The investigations of predictionsreveal that model debiasing can diminish the importance of misleading featuresand can enable models to learn more from the proteins. DebiasedDTA is publishedas an open-source python package to enable debiasing custom DTA predictionmodels with only two lines of code.