ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Abstract

Computational biology and bioinformatics provide vast data gold-mines fromprotein sequences, ideal for Language Models (LMs) taken from Natural LanguageProcessing (NLP). These LMs reach for new prediction frontiers at low inferencecosts. Here, we trained two auto-regressive language models (Transformer-XL,XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFDcontaining up to 393 billion amino acids (words) from 2.1 billion proteinsequences (22- and 112-times the entire English Wikipedia). The LMs weretrained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL),using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). Wevalidated the advantage of up-scaling LMs to larger models supported by biggerdata by predicting secondary structure (3-states: Q3=76-84, 8-states:Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) andwhether a protein is membrane-bound or water-soluble (Q2=89). Dimensionalityreduction revealed that the LM-embeddings from unlabeled data (only proteinsequences) captured important biophysical properties governing protein shape.This implied learning some of the grammar of the language of life realized inprotein sequences. The successful up-scaling of protein LMs through HPC tolarger data sets slightly reduced the gap between models trained onevolutionary information and LMs. The official GitHub repository: https://github.com/agemagician/ProtTrans

Quick Read (beta)

loading the full paper ...