(314e) Combining Protein Sequence and Structure Pretraining

Conference

AIChE Annual Meeting

Year

2022

Proceeding

2022 Annual Meeting

Group

Food, Pharmaceutical & Bioengineering Division

Session

Computational, Structure, Biophysical Protein Engineering

Time

Tuesday, November 15, 2022 - 1:42pm to 2:00pm

Authors

Yang, K. - Presenter

Yeh, H., University of Chicago

Proteins efficiently and precisely perform complex tasks under a wide variety of conditions. This combination of versatility and selectivity makes them not only critical to life, but also to a myriad of human-designed applications. Engineered proteins play increasingly essential roles in industries and applications spanning pharmaceuticals, agriculture, specialty chemicals, and fuel. The ability of a protein to perform a desired function is determined by its amino acid sequence, often mediated through folding to a three-dimensional structure. Machine-learning methods that predict fitness will enable the engineering of new protein functions.

Large pretrained protein language models have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a uses a neural network model trained with either an autoregressive likelihood or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers). For example, ESM-1b is a 650M-parameter transformer model, and CARP-640M is a 640M-parameter convolutional neural network (CNN), both trained on the MLM task using sequences from UniRef. While large pretrained models have had remarkable success in predicting the effects of mutations on protein fitness, predicting protein structure from sequence, and assigning functional annotations, pretraining often provides very little benefit on the types of datasets tasks often encountered in protein engineering. Furthermore, pretraining on only sequences ignores other sources of information about proteins, including structure.

Previous work shows that a graph neural network (GNN) that encodes structural information in addition to sequence context can also be used for masked language modeling of proteins, achieving much better performance on the reconstruction task than sequence-only models with a fraction of the data and model parameters (Figure 1). In this work, we show that a GNN MLM pretrained on 19 thousand sequences and structures outperforms sequence-only pretraining on protein engineering tasks where a structure is available, including zero-shot mutant fitness prediction and tasks in the FLIP (Fitness Landscape Inference for Proteins) benchmark. We then show that using the output logits from a fixed CARP-640M, pretrained on 42M sequences, as input to the GNN further improves performance on both the pretraining MLM task and on downstream tasks.

Historically, most pretraining methods for proteins have treated the amino-acid sequence as text and borrowed methods from natural language processing. However, proteins are not sentences, and protein sequence databases contain additional data that can be useful for pretraining, including structure, annotations, ligand, substrate, or cofactor information, and free text. Integrating this information into pretrained models will be essential to leveraging all available information for protein engineering.

Topics

Protein Engineering

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: December 2024

CEP: November 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.

(314e) Combining Protein Sequence and Structure Pretraining

AIChE Annual Meeting

2022

2022 Annual Meeting

Food, Pharmaceutical & Bioengineering Division

Computational, Structure, Biophysical Protein Engineering

Tuesday, November 15, 2022 - 1:42pm to 2:00pm

Authors

Topics

More Conference Links

Cancellation Policy

Code of Conduct

Beware of Hotel and Attendee-list Scams