(548f) Using Machine Learning to Guide Protein Engineering for Biocatalysis | AIChE

(548f) Using Machine Learning to Guide Protein Engineering for Biocatalysis

Authors 

Han, X. - Presenter, National University of Singapore
Wang, X., National University of Singapore
Zhou, K., National University of Singapore

Using machine
learning to guide protein engineering for biocatalysis

 

Developing biocatalysts
with high activity is an important goal for further development of many
biocatalytic processes. As biocatalysts, proteins with high activity have lower
unit production cost. Some in vivo experimental strategies for enhancing
the protein overexpression have been developed, but they are time-consuming and
expensive and often fail due to opaque reasons (Idicula‐Thomas and Balaji,
2005; Magnan, et al., 2009). A better solution is highly desired and may be
ultimately provided by using a computational model that can predict activity of
any enzyme accurately from its amino acid sequence and other input information.
Unfortunately, only limited data for protein activity are currently available,
which makes developing such models to be difficult. Since protein activity and
solubility are correlated for some proteins, the publicly available solubility
dataset may be adopted to develop models that can predict protein solubility
from sequence. The models could serve as a tool to indirectly predict protein
activity from sequence. In literature, predicting protein solubility from
sequence has been intensively explored, but the predicted solubility
represented in binary values from all the developed models was not suitable for
guiding experimental designs to improve protein solubility.

In this study, we first implemented a novel approach that predicted
protein solubility in continuous numerical values instead of binary ones. After
combining it with various machine learning algorithms, we achieved a
coefficient of determination (R2) of 0.4115 when Support Vector
Machine (SVM) algorithm was used. Continuous values of solubility are more meaningful
in protein engineering, as they enable researchers to choose proteins with
higher predicted solubility for experimental validation, while binary values
fail to distinguish proteins with the same value – there are only two possible
values so many proteins have the same one. Based on the SVM model, two random
optimization algorithms, Genetic Algorithm and an algorithm designed by
ourselves for Optimization of Protein Solubility (OPS) were implemented to find
maximum predicted protein solubility when a small number of amino acid residues
are allowed to be added. Ten proteins with low solubility were selected as
targets for protein solubility improvement. According to the optimization
algorithms, most amino acids added were Aspartate (D) and Glutamate (E). Based
on our preliminary experimental validation, solubility and activity of some of
these proteins were indeed substantially improved after the optimized
solubility enhancers (the amino acid sequence selected by the optimization
algorithms) were used. This study demonstrated that models built by using
machine learning algorithms can be very useful in guiding protein engineering
for biocatalytic applications.

Reference:

Idicula‐Thomas, S. and Balaji, P.V. (2005) Understanding
the relationship between the primary structure of proteins and its propensity
to be soluble on overexpression in Escherichia coli. Protein Sci.,
14, 582–592.

Magnan, C.N., et al. (2009) SOLpro: accurate sequence-based
prediction of protein solubility. Bioinformatics, 25, 2200–2207.