(314e) Combining Protein Sequence and Structure Pretraining
AIChE Annual Meeting
2022
2022 Annual Meeting
Food, Pharmaceutical & Bioengineering Division
Computational, Structure, Biophysical Protein Engineering
Tuesday, November 15, 2022 - 1:42pm to 2:00pm
Large pretrained protein language models have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a uses a neural network model trained with either an autoregressive likelihood or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers). For example, ESM-1b is a 650M-parameter transformer model, and CARP-640M is a 640M-parameter convolutional neural network (CNN), both trained on the MLM task using sequences from UniRef. While large pretrained models have had remarkable success in predicting the effects of mutations on protein fitness, predicting protein structure from sequence, and assigning functional annotations, pretraining often provides very little benefit on the types of datasets tasks often encountered in protein engineering. Furthermore, pretraining on only sequences ignores other sources of information about proteins, including structure.
Previous work shows that a graph neural network (GNN) that encodes structural information in addition to sequence context can also be used for masked language modeling of proteins, achieving much better performance on the reconstruction task than sequence-only models with a fraction of the data and model parameters (Figure 1). In this work, we show that a GNN MLM pretrained on 19 thousand sequences and structures outperforms sequence-only pretraining on protein engineering tasks where a structure is available, including zero-shot mutant fitness prediction and tasks in the FLIP (Fitness Landscape Inference for Proteins) benchmark. We then show that using the output logits from a fixed CARP-640M, pretrained on 42M sequences, as input to the GNN further improves performance on both the pretraining MLM task and on downstream tasks.
Historically, most pretraining methods for proteins have treated the amino-acid sequence as text and borrowed methods from natural language processing. However, proteins are not sentences, and protein sequence databases contain additional data that can be useful for pretraining, including structure, annotations, ligand, substrate, or cofactor information, and free text. Integrating this information into pretrained models will be essential to leveraging all available information for protein engineering.