ProteinNet: a standardized data set for machine learning of protein structure

Abstract

Rapid progress in deep learning has spurred its application to bioinformaticsproblems including protein structure prediction and design. In classic machinelearning problems like computer vision, progress has been driven bystandardized data sets that facilitate fair assessment of new methods and lowerthe barrier to entry for non-domain experts. While data sets of proteinsequence and structure exist, they lack certain components critical for machinelearning, including high-quality multiple sequence alignments and insulatedtraining / validation splits that account for deep but only weakly detectablehomology across protein space. We have created the ProteinNet series of datasets to provide a standardized mechanism for training and assessing data-drivenmodels of protein sequence-structure relationships. ProteinNet integratessequence, structure, and evolutionary information in programmaticallyaccessible file formats tailored for machine learning frameworks. Multiplesequence alignments of all structurally characterized proteins were createdusing substantial high-performance computing resources. Standardized datasplits were also generated to emulate the difficulty of past CASP (CriticalAssessment of protein Structure Prediction) experiments by resetting proteinsequence and structure space to the historical states that preceded six priorCASPs. Utilizing sensitive evolution-based distance metrics to segregatedistantly related proteins, we have additionally created validation setsdistinct from the official CASP sets that faithfully mimic their difficulty.ProteinNet thus represents a comprehensive and accessible resource for trainingand assessing machine-learned models of protein structure.

Quick Read (beta)

loading the full paper ...