Two-Stage Pretraining for Molecular Property Prediction in the Wild

Abstract

Molecular deep learning models have achieved remarkable success in propertyprediction, but they often require large amounts of labeled data. The challengeis that, in real-world applications, labels are extremely scarce, as obtainingthem through laboratory experimentation is both expensive and time-consuming.In this work, we introduce MoleVers, a versatile pretrained molecular modeldesigned for various types of molecular property prediction in the wild, i.e.,where experimentally-validated labels are scarce. MoleVers employs a two-stagepretraining strategy. In the first stage, it learns molecular representationsfrom unlabeled data through masked atom prediction and extreme denoising, anovel task enabled by our newly introduced branching encoder architecture anddynamic noise scale sampling. In the second stage, the model refines theserepresentations through predictions of auxiliary properties derived fromcomputational methods, such as the density functional theory or large languagemodels. Evaluation on 22 small, experimentally-validated datasets demonstratesthat MoleVers achieves state-of-the-art performance, highlighting theeffectiveness of its two-stage framework in producing generalizable molecularrepresentations for diverse downstream properties.

Quick Read (beta)

loading the full paper ...