Evolutionary Profiles for Protein Fitness Prediction

  • 2025-10-08 17:46:02
  • Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen
  • 0

Abstract

Predicting the fitness impact of mutations is central to protein engineeringbut constrained by limited assays relative to the size of sequence space.Protein language models (pLMs) trained with masked language modeling (MLM)exhibit strong zero-shot fitness prediction; we provide a unifying view byinterpreting natural evolution as implicit reward maximization and MLM asinverse reinforcement learning (IRL), in which extant sequences act as expertdemonstrations and pLM log-odds serve as fitness estimates. Building on thisperspective, we introduce EvoIF, a lightweight model that integrates twocomplementary sources of evolutionary signal: (i) within-family profiles fromretrieved homologs and (ii) cross-family structural-evolutionary constraintsdistilled from inverse folding logits. EvoIF fuses sequence-structurerepresentations with these profiles via a compact transition block, yieldingcalibrated probabilities for log-odds scoring. On ProteinGym (217 mutationalassays; >2.5M mutants), EvoIF and its MSA-enabled variant achievestate-of-the-art or competitive performance while using only 0.15% of thetraining data and fewer parameters than recent large models. Ablations confirmthat within-family and cross-family profiles are complementary, improvingrobustness across function types, MSA depths, taxa, and mutation depths. Thecodes will be made publicly available at https://github.com/aim-uofa/EvoIF.

 

Quick Read (beta)

loading the full paper ...