Molecular Language Model as Multi-task Generator

Abstract

Molecule generation with desired properties has grown immensely in popularityby disruptively changing the way scientists design molecular structures andproviding support for chemical and materials design. However, despite thepromising outcome, previous machine learning-based deep generative modelssuffer from a reliance on complex, task-specific fine-tuning, limiteddimensional latent spaces, or the quality of expert rules. In this work, wepropose MolGen, a pre-trained molecular language model that effectively learnsand shares knowledge across multiple generation tasks and domains.Specifically, we pre-train MolGen with the chemical language SELFIES on morethan 100 million unlabelled molecules. We further propose multi-task molecularprefix tuning across several molecular generation tasks and different moleculardomains (synthetic & natural products) with a self-feedback mechanism.Extensive experiments show that MolGen can obtain superior performances onwell-known molecular generation benchmark datasets. The further analysisillustrates that MolGen can accurately capture the distribution of molecules,implicitly learn their structural characteristics, and efficiently explore thechemical space with the guidance of multi-task molecular prefix tuning. Codes,datasets, and the pre-trained model will be available inhttps://github.com/zjunlp/MolGen.

Quick Read (beta)

loading the full paper ...