SmilesT5: Domain-specific pretraining for molecular language models

Abstract

Molecular property prediction is an increasingly critical task within drugdiscovery and development. Typically, neural networks can learn molecularproperties using graph-based, language-based or feature-based methods. Recentadvances in natural language processing have highlighted the capabilities ofneural networks to learn complex human language using masked languagemodelling. These approaches to training large transformer-based deep learningmodels have also been used to learn the language of molecules, as representedby simplified molecular-input line-entry system (SMILES) strings. Here, wepresent novel domain-specific text-to-text pretraining tasks that yieldimproved performance in six classification-based molecular property predictionbenchmarks, relative to both traditional likelihood-based training andpreviously proposed fine-tuning tasks. Through ablation studies, we show thatdata and computational efficiency can be improved by using thesedomain-specific pretraining tasks. Finally, the pretrained embeddings from themodel can be used as fixed inputs into a downstream machine learning classifierand yield comparable performance to finetuning but with much lowercomputational overhead.

Quick Read (beta)

loading the full paper ...