(28y) Enhancing Next Generation Systems Biology Models with Deep Learning for Initial Conditions Specification | AIChE

(28y) Enhancing Next Generation Systems Biology Models with Deep Learning for Initial Conditions Specification

Authors 

Karakoltzidis, A. - Presenter, Aristotle University of Thessaloniki
Karakitsios, S., Aristotle University of Thessaloniki
Sarigiannis, D., Aristotle University of Thessaloniki
The development of systems biology mathematical models requires the determination of numerous parameters, including turnover numbers, Michaelis-Menten constants, and initial conditions for differential equations. The existence of these parameters presupposes conducting experiments in vitro or in vivo and publishing the results in peer-reviewed databases or scientific journals. Due to the difficulty and cost of experiments, as well as ethical considerations regarding the use of animals in research, an in silico methodology based on machine learning and Quantitative Structure Activity Relationship (QSAR) has been developed to determine concentrations of metabolites when experimental data is unavailable.

Next Generation Systems Biology (NGSB) models are mathematical models that are built on a large scale and contain hundreds of differential equations. These models can serve as the basis for constructing systems biology-based adverse outcome pathways (AOPs) and quantitative AOPs, as well as for studying human systems with all relevant parameters included. Additionally, they offer a wide range of potential applications, making them a valuable tool in risk assessment, decision-making processes, and in pharmaceutical and industrial contexts.

Defining the initial conditions for the many differential equations involved in systems biology models is challenging; it requires numerical values for the concentrations of hundreds of metabolites, and not all necessary experiments may have been performed to quantify them. As a result, many differential equations may lack initial values during the modelling process. To address this challenge, we have developed a methodology that allows for the determination of each concentration based on the number of available metabolic pathways and the initial concentrations of neighbouring nodes in the pathway. This methodology exclusively uses pathways from the KEGG (Kanehisa et al., 2017) database, though we expect to integrate other similar databases in subsequent versions of the software.

To begin, specific data sets are prepared for use. These sets contain information on the endogenous metabolites involved, molecular fingerprints, Tanimoto similarity indexes, and other types of data extracted from mol files that are fed into the constructed pipeline. Mol files are derived from the KEGG database (Kanehisa et al., 2017) and are integrated into the pipeline using the RCDK library (Guha & Cherto, 2017). Information about the initial concentrations of each endogenous metabolite is obtained from the SABIO database (Wittig et al., 2012).

After pre-processing the datasets, a deep neural model is developed using the Neuralnet software (Günther & Fritsch, 2010) within the R programming environment. The model is trained based on known concentrations in the surroundings of the corresponding metabolite for each metabolic pathway entered into the pipeline. These models are trained in an unsupervised manner, running in parallel with each other and self-rejecting if the R2 value falls below 0.7. Prior to unsupervised training, an in-house algorithm is applied to parameterize the model. If a neighbouring node of the metabolite of interest does not have a known concentration, the concentrations of the neighbouring nodes are studied. The node search process continues until concentration information becomes available.

The presented methodology has diverse applications in biology as it enables the determination of initial concentrations for not only endogenous metabolites but also other compounds. Moreover, it ensures that all differential equations in the constructed mathematical model have an initial condition. Additionally, it can be used in industrial and pharmaceutical settings to determine concentrations based on reactants and products, particularly in systems with a high degree of freedom. In the near future, this application will be released as a package for the R computing environment.


Guha, R., & Cherto, M. R. (2017). rcdk: Integrating the CDK with R. In: CRAN.

Günther, F., & Fritsch, S. (2010). Neuralnet: training of neural networks. R J., 2(1), 30. https://svn.r-project.org/Rjournal/trunk/html/_site/archive/2010/RJ-2010-006/RJ-2010-006.pdf

Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., & Morishima, K. (2017). KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res, 45(D1), D353-D361. https://doi.org/10.1093/nar/gkw1092

Wittig, U., Kania, R., Golebiewski, M., Rey, M., Shi, L., Jong, L., Algaa, E., Weidemann, A., Sauer-Danzwith, H., & Mir, S. (2012). SABIO-RK—database for biochemical reaction kinetics. Nucleic Acids Res, 40(D1), D790-D796. https://doi.org/10.1093/nar/gkr1046