Abstract
Foundation models have revolutionized natural language processing andartificial intelligence, significantly enhancing how machines comprehend andgenerate human languages. Inspired by the success of these foundation models,researchers have developed foundation models for individual scientific domains,including small molecules, materials, proteins, DNA, RNA and even cells.However, these models are typically trained in isolation, lacking the abilityto integrate across different scientific domains. Recognizing that entitieswithin these domains can all be represented as sequences, which together formthe "language of nature", we introduce Nature Language Model (NatureLM), asequence-based science foundation model designed for scientific discovery.Pre-trained with data from multiple scientific domains, NatureLM offers aunified, versatile model that enables various applications including: (i)generating and optimizing small molecules, proteins, RNA, and materials usingtext instructions; (ii) cross-domain generation/design, such asprotein-to-molecule and protein-to-RNA generation; and (iii) top performanceacross different domains, matching or surpassing state-of-the-art specialistmodels. NatureLM offers a promising generalist approach for various scientifictasks, including drug discovery (hit generation/optimization, ADMEToptimization, synthesis), novel material design, and the development oftherapeutic proteins or nucleotides. We have developed NatureLM models indifferent sizes (1 billion, 8 billion, and 46.7 billion parameters) andobserved a clear improvement in performance as the model size increases.