(344h) Progan: Protein Solubility Generative Adversarial Nets for Data Augmentation in Dnn Framework | AIChE

(344h) Progan: Protein Solubility Generative Adversarial Nets for Data Augmentation in Dnn Framework

Authors 

Han, X. - Presenter, National University of Singapore
Zhou, K., National University of Singapore
Wang, X., National University of Singapore

ProGAN: Protein
Solubility Generative Adversarial Nets for Data Augmentation in DNN Framework

Generative Adversarial Networks (GANs), a
state-of-the-art data augmentation algorithm in Artificial Intelligence (AI),
provide an attractive and novel approach to generate mimic data which seems
very realistic, such as genes and proteins in biological fields. GANs have
achieved promising performance in the fields of image recognition and computer
vision [1]. Those data augmentation algorithms may alleviate the problem of
lacking data in some biological fields where producing data is time-consuming
and labour-intensive. In literature, many machine learning models have been
investigated to predict protein solubility from protein sequence, whereas the
parameters of those models are underdetermined with insufficient data of
protein solubility. However, the data augmentation algorithms have not been
investigated to solve the problem of protein solubility and other problems in
biological research. In this study, we developed a data augmentation algorithm, ProGAN to improve the prediction performance of protein
solubility based on a deep neural network (DNN) regression model.

All the proteins
used in the study were downloaded from eSol database [2], which is a unique
database with continuous values of solubility, ranging from 0 to 1. The
regression prediction model using DNN was developed and a data augmentation
algorithm ProGAN was further applied to this dataset. Finally, we evaluated the
model performance by using coefficient of determination (R2) and
other metrics calculated from predicted and actual solubility values.

At first, we propose DNN as a more accurate regression
prediction model. A more accurate model is beneficial for guiding experimental
validations and further optimization on the protein properties. In addition, to
tackle the insufficient data problem, a novel data augmentation algorithm, ProGAN,
was proposed for improving the prediction of protein solubility. By developing
a data augmentation algorithm with latent label for protein solubility in the
generator and an auxiliary regressor in the discriminator, tuning the hyperparameters
of ProGAN and organizing the dataset, we achieved a R2 value of
45.04%, which enhanced R2 about 10% compared with the previous study
using the same dataset. The mean squared error (MSE) was further reduced to
0.0563 on average and the analysis of variance (ANOVA) also demonstrated statistical
significance. Data augmentation opens the door to applications of machine
learning models on biological data, as machine learning models frequently fail
to be well trained using small datasets. More importantly, this data
augmentation algorithm provides an approach to generate mimic data in silico
without hands-on experiments, which brings new insights into the fields where
collecting massive data is challenging.

Reference:

[1] Goodfellow,
Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial
nets." In Advances in neural information processing systems,
pp. 2672-2680. 2014.

[2] Niwa, Tatsuya, Bei-Wen Ying, Katsuyo Saito, WenZhen
Jin, Shoji Takada, Takuya Ueda, and Hideki Taguchi. "Bimodal protein
solubility distribution revealed by an aggregation analysis of the entire
ensemble of Escherichia coli proteins." Proceedings of the
National Academy of Sciences
 106, no. 11 (2009): 4201-4206.