Abstract
Mass spectrometry plays a fundamental role in elucidating the structures ofunknown molecules and subsequent scientific discoveries. One formulation of thestructure elucidation task is the conditional $\textit{de novo}$ generation ofmolecular structure given a mass spectrum. Toward a more accurate and efficientscientific discovery pipeline for small molecules, we present DiffMS, aformula-restricted encoder-decoder generative network that achievesstate-of-the-art performance on this task. The encoder utilizes a transformerarchitecture and models mass spectra domain knowledge such as peak formulae andneutral losses, and the decoder is a discrete graph diffusion model restrictedby the heavy-atom composition of a known chemical formula. To develop a robustdecoder that bridges latent embeddings and molecular structures, we pretrainthe diffusion decoder with fingerprint-structure pairs, which are available invirtually infinite quantities, compared to structure-spectrum pairs that numberin the tens of thousands. Extensive experiments on established benchmarks showthat DiffMS outperforms existing models on $\textit{de novo}$ moleculegeneration. We provide several ablations to demonstrate the effectiveness ofour diffusion and pretraining approaches and show consistent performancescaling with increasing pretraining dataset size. DiffMS code is publiclyavailable at https://github.com/coleygroup/DiffMS.