Predicting Amyloid Fibrillation through Transfer Learning | AIChE

Predicting Amyloid Fibrillation through Transfer Learning

Drug discovery is a complex multi-optimization problem that involves balancing the biological activity and developability of target compounds. Recent advances in generative machine learning models have contributed to the streamlining of these processes but lack precise control over target biophysical properties—many of which can be harmful to the development of a peptide therapeutic. Amyloid fibrils are a form of aggregate characterized by ordered stacks of peptide, which form a fibrous structure. In the discovery process, these aggregates are difficult to metabolize and are often less potent than free molecules. In this work, we develop a generalizable model that can predict amyloid fibrillation through protein language model (pLM) embeddings. We utilize ESM2—a pLM—to generate latent embeddings for sequences of interest, which are then passed to our model. Experimental data for fibrillation is limited and the largest public datasets consist of sequences that are much shorter than therapeutically relevant peptides. We explore transfer learning preprocessing strategies that allow us to effectively generalize to new sequence lengths, including mean-pooling and a modified convolutional neural network with attention weights—which is referred to as light attention (LA). Processed embeddings are passed to a standard multilayer perceptron (MLP) and predictions are scored against labelled data. These architectures demonstrate high predictive power when evaluated on two publicly available datasets and serve to expedite the development of peptide-based pharmaceuticals.