(548f) Using Machine Learning to Guide Protein Engineering for Biocatalysis
AIChE Annual Meeting
2019
2019 AIChE Annual Meeting
Food, Pharmaceutical & Bioengineering Division
Protein Science and Engineering: Advances on Engineering the Physiochemical Properties of Proteins
Wednesday, November 13, 2019 - 2:00pm to 2:18pm
Using machine
learning to guide protein engineering for biocatalysis
Developing biocatalysts
with high activity is an important goal for further development of many
biocatalytic processes. As biocatalysts, proteins with high activity have lower
unit production cost. Some in vivo experimental strategies for enhancing
the protein overexpression have been developed, but they are time-consuming and
expensive and often fail due to opaque reasons (IdiculaâThomas and Balaji,
2005; Magnan, et al., 2009). A better solution is highly desired and may be
ultimately provided by using a computational model that can predict activity of
any enzyme accurately from its amino acid sequence and other input information.
Unfortunately, only limited data for protein activity are currently available,
which makes developing such models to be difficult. Since protein activity and
solubility are correlated for some proteins, the publicly available solubility
dataset may be adopted to develop models that can predict protein solubility
from sequence. The models could serve as a tool to indirectly predict protein
activity from sequence. In literature, predicting protein solubility from
sequence has been intensively explored, but the predicted solubility
represented in binary values from all the developed models was not suitable for
guiding experimental designs to improve protein solubility. In this study, we first implemented a novel approach that predicted
protein solubility in continuous numerical values instead of binary ones. After
combining it with various machine learning algorithms, we achieved a
coefficient of determination (R2) of 0.4115 when Support Vector
Machine (SVM) algorithm was used. Continuous values of solubility are more meaningful
in protein engineering, as they enable researchers to choose proteins with
higher predicted solubility for experimental validation, while binary values
fail to distinguish proteins with the same value â there are only two possible
values so many proteins have the same one. Based on the SVM model, two random
optimization algorithms, Genetic Algorithm and an algorithm designed by
ourselves for Optimization of Protein Solubility (OPS) were implemented to find
maximum predicted protein solubility when a small number of amino acid residues
are allowed to be added. Ten proteins with low solubility were selected as
targets for protein solubility improvement. According to the optimization
algorithms, most amino acids added were Aspartate (D) and Glutamate (E). Based
on our preliminary experimental validation, solubility and activity of some of
these proteins were indeed substantially improved after the optimized
solubility enhancers (the amino acid sequence selected by the optimization
algorithms) were used. This study demonstrated that models built by using
machine learning algorithms can be very useful in guiding protein engineering
for biocatalytic applications. Reference: IdiculaâThomas, S. and Balaji, P.V. (2005) Understanding
the relationship between the primary structure of proteins and its propensity
to be soluble on overexpression in Escherichia coli. Protein Sci.,
14, 582â592. Magnan, C.N., et al. (2009) SOLpro: accurate sequence-based
prediction of protein solubility. Bioinformatics, 25, 2200â2207.