Abstract
Joint representations between images and text have been deeply investigatedin the literature. In computer vision, the benefits of incorporating naturallanguage have become clear for enabling semantic-level control of images. Inthis work, we present $\textbf{MolT5}-$a self-supervised learning framework forpretraining models on a vast amount of unlabeled natural language text andmolecule strings. $\textbf{MolT5}$ allows for new, useful, and challenginganalogs of traditional vision-language tasks, such as molecule captioning andtext-based de novo molecule generation (altogether: translation betweenmolecules and language), which we explore for the first time. Furthermore,since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcomethe chemistry domain shortcoming of data scarcity. Additionally, we considerseveral metrics, including a new cross-modal embedding-based metric, toevaluate the tasks of molecule captioning and text-based molecule generation.By interfacing molecules with natural language, we enable a higher semanticlevel of control over molecule discovery and understanding--a critical task forscientific domains such as drug discovery and material design. Our results showthat $\textbf{MolT5}$-based models are able to generate outputs, both moleculeand text, which in many cases are high quality and match the input modality. Onmolecule generation, our best model achieves 30% exact matching test accuracy(i.e., it generates the correct structure for about one-third of the captionsin our held-out test set).