(601b) An Extended Data Mining Method for Identifying Differentially Expressed Assay-Specific Signatures in Functional Genomic Studies | AIChE

(601b) An Extended Data Mining Method for Identifying Differentially Expressed Assay-Specific Signatures in Functional Genomic Studies

Authors 

Rollins, D. - Presenter, Iowa State University
Teh, A. - Presenter, Iowa State University


Microarray data sets provide relative expression levels for thousands of genes for a small number, in comparison, of different experimental conditions called assays. Data mining techniques are used to extract specific information of genes as they relate to the assays. The multivariate statistical technique of principal component analysis (PCA) has proven useful in providing effective data mining methods. This talk extends the PCA approach of Rollins et al. (2006) to the development of ranking genes of microarray data sets that express most differently between two biologically different groupings of assays. This method is evaluated on real and simulated data and compared to a current approach on the basis of false discovery rate (FDR) and statistical power (SP) which is the ability to correctly identify important genes. This work developed and evaluated two new test statistics based on PCA and compared them to a popular method that is not PCA based. Both test statistics were found to be effective as evaluated in three case studies: (i) exposing E. coli cells to two different ethanol levels; (ii) application of myostatin to two groups of mice; and (iii) a simulated data study derived from the properties of (ii). The proposed method (PM) effectively identified critical genes in these studies based on comparison with the current method (CM). The simulation study supports higher identification accuracy for PM over CM for both proposed test statistics when the gene variance is constant and for one of the test statistics when the gene variance is non-constant. PM compares quite favorably to CM in terms of lower FDR and much higher SP. Thus, PM can be quite effective in producing accurate signatures from large microarray data sets for differential expression between assays groups identified in a preliminary step of the PCA procedure and is, therefore, recommended for use in these applications. It is well known that living organisms have complicated gene structures. However, while major advancements have been made in recent years, understanding of the biological functions of each individual gene is still quite limited. Active research is strongly focused on understanding the behavior of genes and as well as the highly complex metabolism and regulatory network inside living cells. This effort falls under a molecular biological field called functional genomics (FG). There are at least three areas in which experimental techniques are widely applied in FG: transcriptomics, proteomics, and metabolomics. A combination of leading scientific techniques as well as powerful mathematical and statistical tools for data analysis makes the task of identifying important transcriptome, proteome, and metabolome corresponding to a biological effect promising. Typical studies in these areas involve the identification of possible behavior and responses of species under various genetic backgrounds as well as environmental factors (i.e. assay). There are different high technology techniques applied in FG field to advance understanding of the transcriptional genetic response of many organisms in various environmental perturbations. One of the techniques that have been adopted in this field is a multiplex technology called DNA microarray. A new technique that is becoming popular and will probably displace array-based measurement in FG is next-generation sequencing (RNAseq). These techniques have the ability to generate data sets that consist of expression levels of thousands of genes, providing a wealth of information that is hidden by high noise levels, low signal levels, and a relatively small number of experimental units to the number of genes studied. More specifically, since the data set containing the gene expression measurements consists of a lot more genes than assays, analytical techniques are needed to provide accurate gene identification under a large number of gene candidates that is much greater than the number of experimental runs. To achieve this objective, traditional statistical methods, such as principal component analysis (PCA), the focus of this article, are being retrofitted to provide effective statistical inference in this challenging context of microarray data analysis. Other methods used in this field included linear model analysis, Bayesian method and neural network analysis (NCA). Thus, statistics is playing a critical role through the development of methodologies that give high statistical power (SP) (i.e., accurate identification), and low false discovery rate (FDR) (i.e. low misidentification). To this end, this talk introduces two new PCA based statistics for determining gene rank for differential expression between two PCA identified assay groups. This work extends the technique introduced by Rollins et al. (2006) that determines gene rank for a single PCA identified assay group. Thus, the proposed method (PM) in this work is aimed at finding the genes with high expression levels in one group and low expression levels in the other group. The PM uses PCA to first establish the existence of the assay groupings of interest. Then using the results that established the grouping, the differential contribution for each gene is determined using a statistic based on eigenvalues. This talk discusses and evaluates two statistics. The first one is the group averaged difference of eigenvalue linear combinations that we call Tdiff. The second one divides Tdiff by its estimated pooled standard deviation that we call Tscaled. The genes are ranked based on the largest absolute value of these statistics. The PM is evaluated against the ranking determined by the well known Student's t-statistic that we call Tpooled in this work. We will refer to Tpooled as the current method (CM) which is actually a subclass of the PM that weighs each assay equally in each group. Note that for the CM the assay members in each group is not established based on the data but by á priori considerations. In contrast, for the PM the data drives the assay weight as well as group assignment of the assays. The CM and PM are applied in the following three case studies to compare their effectiveness (i.e., power) in identifying assay-specific signature: (i) exposure of E. coli cells to two different levels of ethanol concentration; (ii) the use of myostatin as inhibitor of skeletal muscle growth for five 5-weeks-old myostatin and non-treated mice; and (iii) a simulation study based on statistical properties of the second case study. This talk will give a brief review of PCA and connect it to our application in FG's data analysis. Next we will derive and present the test statistics of the CM and PM. These test statistics will be evaluated and compared in three studies.