Translation between Molecules and Natural Language

Abstract

Joint representations between images and text have been deeply investigatedin the literature. In computer vision, the benefits of incorporating naturallanguage have become clear for enabling semantic-level control of images. Inthis work, we present $\textbf{MolT5}-$a self-supervised learning framework forpretraining models on a vast amount of unlabeled natural language text andmolecule strings. $\textbf{MolT5}$ allows for new, useful, and challenginganalogs of traditional vision-language tasks, such as molecule captioning andtext-based de novo molecule generation (altogether: translation betweenmolecules and language), which we explore for the first time. Furthermore,since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcomethe chemistry domain shortcoming of data scarcity. Additionally, we considerseveral metrics, including a new cross-modal embedding-based metric, toevaluate the tasks of molecule captioning and text-based molecule generation.By interfacing molecules with natural language, we enable a higher semanticlevel of control over molecule discovery and understanding--a critical task forscientific domains such as drug discovery and material design. Our results showthat $\textbf{MolT5}$-based models are able to generate outputs, both moleculeand text, which in many cases are high quality and match the input modality. Onmolecule generation, our best model achieves 30% exact matching test accuracy(i.e., it generates the correct structure for about one-third of the captionsin our held-out test set).

Quick Read (beta)

loading the full paper ...