Abstract
Molecule-and-text cross-modal representation learning has emerged as apromising direction for enhancing the quality of molecular representation,thereby improving performance in various scientific fields. However, mostapproaches employ a global alignment approach to learn the knowledge fromdifferent modalities that may fail to capture fine-grained information, such asmolecule-and-text fragments and stereoisomeric nuances, which is crucial fordownstream tasks. Furthermore, it is incapable of modeling such informationusing a similar global alignment strategy due to the lack of annotations aboutthe fine-grained fragments in the existing dataset. In this paper, we proposeAtomas, a hierarchical molecular representation learning framework that jointlylearns representations from SMILES strings and text. We design a HierarchicalAdaptive Alignment model to automatically learn the fine-grained fragmentcorrespondence between two modalities and align these representations at threesemantic levels. Atomas's end-to-end training framework supports understandingand generating molecules, enabling a wider range of downstream tasks. Atomasachieves superior performance across 12 tasks on 11 datasets, outperforming 11baseline models thus highlighting the effectiveness and versatility of ourmethod. Scaling experiments further demonstrate Atomas's robustness andscalability. Moreover, visualization and qualitative analysis, validated byhuman experts, confirm the chemical relevance of our approach. Codes arereleased on https://github.com/yikunpku/Atomas.