A Diffusion Model to Shrink Proteins While Maintaining Their Function

  • 2025-11-10 18:46:24
  • Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson
  • 0

Abstract

Many proteins useful in modern medicine or bioengineering are challenging tomake in the lab, fuse with other proteins in cells, or deliver to tissues inthe body, because their sequences are too long. Shortening these sequencestypically involves costly, time-consuming experimental campaigns. Ideally, wecould instead use modern models of massive databases of sequences from natureto learn how to propose shrunken proteins that resemble sequences found innature. Unfortunately, these models struggle to efficiently search thecombinatorial space of all deletions, and are not trained with inductive biasesto learn how to delete. To address this gap, we propose SCISOR, a noveldiscrete diffusion model that deletes letters from sequences to generateprotein samples that resemble those found in nature. To do so, SCISOR trains ade-noiser to reverse a forward noising process that adds random insertions tonatural sequences. As a generative model, SCISOR fits evolutionary sequencedata competitively with previous large models. In evaluation, SCISOR achievesstate-of-the-art predictions of the functional effects of deletions onProteinGym. Finally, we use the SCISOR de-noiser to shrink long proteinsequences, and show that its suggested deletions result in significantly morerealistic proteins and more often preserve functional motifs than previousmodels of evolutionary sequences.

 

Quick Read (beta)

loading the full paper ...