Concept Bottleneck Language Models For protein design

Abstract

We introduce Concept Bottleneck Protein Language Models (CB-pLM), agenerative masked language model with a layer where each neuron corresponds toan interpretable concept. Our architecture offers three key benefits: i)Control: We can intervene on concept values to precisely control the propertiesof generated proteins, achieving a 3 times larger change in desired conceptvalues compared to baselines. ii) Interpretability: A linear mapping betweenconcept values and predicted tokens allows transparent analysis of the model'sdecision-making process. iii) Debugging: This transparency facilitates easydebugging of trained models. Our models achieve pre-training perplexity anddownstream task performance comparable to traditional masked protein languagemodels, demonstrating that interpretability does not compromise performance.While adaptable to any language model, we focus on masked protein languagemodels due to their importance in drug discovery and the ability to validateour model's capabilities through real-world experiments and expert knowledge.We scale our CB-pLM from 24 million to 3 billion parameters, making them thelargest Concept Bottleneck Models trained and the first capable of generativelanguage modeling.

Quick Read (beta)

loading the full paper ...