Abstract
Designing de-novo molecules with desired property profiles requires efficientexploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$possible synthesizable candidates. While various deep generative models havebeen developed to design small molecules using diverse input representations,Molecular Large Language Models (Mol-LLMs) based on string representations haveemerged as a scalable approach capable of exploring billions of molecules.However, there remains limited understanding regarding how standard languagemodeling practices such as textual representations, tokenization strategies,model size, and dataset scale impact molecular generation performance. In thiswork, we systematically investigate these critical aspects by introducingNovoMolGen, a family of transformer-based foundation models pretrained on 1.5billion molecules for de-novo molecule generation. Through extensive empiricalanalyses, we identify a weak correlation between performance metrics measuredduring pretraining and actual downstream performance, revealing importantdistinctions between molecular and general NLP training dynamics. NovoMolGenestablishes new state-of-the-art results, substantially outperforming priorMol-LLMs and specialized generative models in both unconstrained andgoal-directed molecular generation tasks, thus providing a robust foundationfor advancing efficient and effective molecular modeling strategies.