(28w) Artificial Intelligence-Based Parametrization of Next Generation Systems Biology Models | AIChE

(28w) Artificial Intelligence-Based Parametrization of Next Generation Systems Biology Models

Authors 

Karakoltzidis, A. - Presenter, Aristotle University of Thessaloniki
Karakitsios, S., Aristotle University of Thessaloniki
Sarigiannis, D., Aristotle University of Thessaloniki
Next-generation systems biology models (NGSB models) are built on a large scale, including hundreds of differential equations with millions of interactions and metabolic transformations. These models aim to approximate real life as closely as possible by incorporating all endogenous metabolites involved in a biological process as well as those that increase or decrease their concentration simultaneously and may be indirectly affected by a disruptor or other biological events. These applications provide the basis for the development of systems biology-based adverse outcome pathways (AOPs) and quantitative AOPs, tools that are valuable in 21st century risk assessment and decision-making procedures.

The construction of such a mathematical model requires a high level of expertise; in addition, a very large number of kinetic parameters, such as turnover numbers or Michaelis – Menten constants have to be properly estimated, in order for the models to be feasible. It is well known that most of the detected turnover numbers estimated by wet lab experiments have not been quantified yet. Turnover numbers upon which training and test sets rely on are taken from the BRENDA (Jeske et al., 2019) and SABIO databases (Wittig et al., 2012) to date. These databases except from providing the training basis feed the NGSB models with actual turnover numbers leading to more accurate predictions. Detection of unknown turnover numbers is time-consuming, thus often inhibiting the completion of this task.

Consequently, train and test sets built based on the information stemming from the relevant databases and then Deep Neural Network (DNN) models are trained and finely optimized to estimate turnover numbers The preprocessed integrated dataset includes the basis for the construction of each enzyme and reaction component, including enzymes sequences, molecular fingerprints and other chemical attributes derived from mol files. The description of endogenous metabolites participating in each reaction is introduced to the model with the MACCS and PUBCHEM fingerprints incorporated by Tanimoto similarity indexes. Mol files are derived from the KEGG database (Kanehisa, 2002). The construction of the dataset is followed by the training process.

TensorFlow (Abadi et al., 2016) and Keras (Chollet F. Keras, 2015) modules are currently used for the DNN model development. These tools provide a user-friendly interface and the advantage of the integration with other tools used for machine and deep learning. The optimization process was carried out with trial-and-error in addition to a multicore process developed in-house. This process allowed the development of a model well optimized with the selection of a proper set of parameters. The biological integration that each enzyme has, relies on fasta files production based on their sequences and, by incorporating the algorithm of Alley et al. (2019) and Natural Language Processing (NLP) techniques, numerical vectors were produced representing the structure of each enzyme. The model we developed can predict a turnover number with an R2 of 0.56. The methodology constructed is independent of the organism for which kinetic parameters are predicted.

The procedure described is expected to provide solutions for the parametrization of systems biology and NGSB models, as well as for industrial and pharmaceutical applications including enzymatic processes which can benefit from these types of models. Soon, an online version of these models will be made available, thus offering users the opportunity to make their own applications, by providing the sequences of the enzymes they are interested in using already trained models as well as a GitHub repository in which users will be able to access the pre trained models and introduce them in their code.

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., & Devin, M. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G. M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12), 1315-1322.

Chollet F. Keras. (2015). GitHub. Seattle, WA, USA. https://keras.io

Jeske, L., Placzek, S., Schomburg, I., Chang, A., & Schomburg, D. (2019). BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res, 47(D1), D542-D549.

Kanehisa, M. (2002). The KEGG database. ‘In Silico’Simulation of Biological Processes: Novartis Foundation Symposium 247,

Wittig, U., Kania, R., Golebiewski, M., Rey, M., Shi, L., Jong, L., Algaa, E., Weidemann, A., Sauer-Danzwith, H., & Mir, S. (2012). SABIO-RK—database for biochemical reaction kinetics. Nucleic Acids Res, 40(D1), D790-D796. https://doi.org/10.1093/nar/gkr1046