(42c) Using Machine Learning to Map Gene Regulatory Networks | AIChE

(42c) Using Machine Learning to Map Gene Regulatory Networks

Authors 

Bandodkar, P. - Presenter, Texas A&M University
Reeves, G., Texas A&M University
Haque, S., NCSU
Rooney, L., NCSU
Cells respond to their environment by a variety of methods. One of the most important mechanisms is by regulating the expression of protein-coding genes at the level of transcription. A set of proteins called transcription factors binds to regulatory regions of a gene and modulates its transcription levels. Typically, several transcription factors participate in gene regulation forming a network. Such Gene Regulatory Networks (GRNs) exist to buffer perturbations in the environment. This is especially important in the context of development, where cell differentiation must occur with high precision in location and timing to ensure that the body develops correctly.

Drosophila is a long germband insect which means body segments are formed simultaneously during development. Segments in Drosophila are formed as a consequence of a gene expression cascade starting from maternal proteins called morphogens, such as Bicoid (bcd), expressed as a gradient across the length of the developing embryo, that activates the zygotic genome. Maternal morphogens activate gap genes, such as Kruppel (Kr), expressed in large domains. Gap genes, along with maternal proteins, then activate pair-rule genes, such as even-skipped (eve), that are expressed in 7-8 stripes, which then activate segment polarity genes that ultimately establish the 14 segments found on the body of a fly.

The segmentation gene network, especially the gap gene network, is one of the most well-studied GRNs in any organism. However, evidence in the literature suggests that these maps are still incomplete. The network components are typically identified by single-gene analysis by knocking out a gene and observing the phenotype. While such techniques have worked well in the identification of major network components, the components responsible for compensatory regulation remain to be discovered. This is because in an overtly perturbed genome, such as in the case of a knockout, several connections are lost, making the discovery of novel components harder.

Our approach to discover such compensatory regulation is to use the Drosophila Genetic Reference Panel (DGRP): a panel of ~200 wild-caught fly lines with naturally varying genomes. The idea is that in such a panel, while the major network components would still be in place, compensatory regulation might be different due to flies being exposed to subtly different environments. This approach would require the acquisition of three datasets – genomic, transcriptomic, and spatial gene expression data for the entire panel. By correlating the shifts in gene expression patterns to the subtle variation in the genome and the transcriptome of the flies in the DGRP, we could discover novel regulation in the gap GRN.

In our preliminary work, we used 13 fly lines to quantify spatial gene expression data of two network components – Kr and eve - and correlated the differences in spatial position to Single Nucleotide Polymorphisms (SNPs) in the genome within a 20 kb region. SNPs that were statistically significant and not in the known regulatory region of the genes were selected to be placed in a reporter gene construct to test for relevance. Such reporter gene analysis led to the identification of a putative regulatory region (PRR) for eve and Kr. We also identified a potential transcription factor, pangolin (pan), that binds to the PRR. We confirmed it by observing shifts in the expression patterns of Kr and eve in a pan RNAi knockdown line. We also obtained the transcriptome of these fly lines and found differentially expressed genes that correlated to the shifts in spatial gene expression and the SNPs in the genome. The relevance of these genes as potential nodes in the gap GRN was not probed further in the preliminary studies.

The genome for all 200 fly lines has already been sequenced. Also, the pipeline to extract transcript levels from RNAseq experiments is in place. Currently, we have developed techniques to reliably extract quantitative spatial gene expression data from a large number of stained embryos without manual supervision. Unsupervised learning algorithms, such as Relational Fuzzy C-Means (RFCM) clustering, would then be employed to reveal clusters of similar transcripts in the transcriptome and SNPs in the naturally varying genome of the DGRP flies, that would be then be correlated to shifts in gene expression patterns to reveal novel components and targets of the gap GRN.

Other studies on the DGRP have shown that there is enough statistical power in the ~200 fly lines to correlate changes in the genotype to that in the phenotype. Therefore, we hypothesize that using data from all 200 fly lines would reveal network components responsible for compensatory regulation in the segmentation gene network. This technique of using natural variation could then be potentially applied to other biological networks where high-resolution data may be obtained.