(582e) Microarray Data Mining: A Novel Optimization-Based Iterative Clustering Approach to Uncover Biologically Coherent Structures
AIChE Annual Meeting
2006
2006 Annual Meeting
Computational Molecular Science and Engineering Forum
Computational Biology: Systems Modeling I
Thursday, November 16, 2006 - 4:27pm to 4:45pm
The clustering of genes based on DNA microarray data often sets the cornerstone for further studies in areas such as motif alignments and regulatory structure networks. Hence, it is important to first possess an intuitive tool to assess and refine the biological coherence of these groupings. In our study, we recognize that the task of placing genes into strongly coherent clusters may not necessarily be achieved with a single round of clustering, no matter how robust the clustering algorithm has been proven to be. We build upon a previously-proposed clustering algorithm, EP_GOS_Clust [1] that has already been shown to be more rigorous than many commonly-used algorithms used to cluster DNA microarray data. The EP_GOS_Clust is based on a Mixed-Integer Nonlinear Programming (MINLP) formulation and employ the Global Optimum Search (GOS) [2, 3] algorithm augmented with enhanced positioning to expedite the solution of large-scale clustering problems. From the original EP_GOS_Clust approach, we propose a novel methodology [4] to uncover the best possible biologically coherence structures from DNA microarray data. We use as a decision metric in our approach the hyper-geometrically distributed P-values returned by Gene Ontology resources. From an initial clustering run up to the optimal number of clusters, a minimum cut-off for the P-values is then applied to refine the pre-clusters for the next iteration. The process terminates when either the number of clusters or P-value distribution saturates. In measuring the biological coherence of the clusters, we assert that the appropriate performance indicators for this study are the P-values of the resultant clusters and the proportion of genes that are grouped into clusters of high coherence quality.
We test our proposed algorithm on two substantially large datasets of gene expression patterns from the yeast Saccharomyces Cerevisiae. The first dataset is obtained from experiments designed to examine the role of Ras2 and Gpa2 in effecting transcriptional changes in the response of yeast cells to glucose, and consists of 5652 genes, each with 24 feature points. Glucose however, has more far-ranging effects on yeast than just the Ras signaling pathway. Besides stimulating the Ras and Gpa2 pathways, at least two additional pathways are affected in yeast as a result of glucose addition. The second set of data is derived from experiments designed to study the roles of the Ras, Snf1, and Sch9 proteins in effecting transcriptional changes as a result of yeast cellular response to glucose. It consists of 5657 genes, each with 75 feature points. We show that our method saturates after a certain number of iterations, resulting in a higher level of biological coherence as a whole and a significantly improved proportion of genes placed into quality clusters. As such, we believe our work to be valuable to cluster analysis, in particularly a detailed study of the cluster members for the purpose of identifying gene regulatory modules and predicting common motif structure.
[1]-Tan, M. P.; Broach, J. R.; Floudas, C. A.; A Novel Mixed-Integer Nonlinear Optimization-Based Clustering Approach: Global Optimum Search with Enhanced Positioning (EP_GOS_Clust) and Determination of Optimal Number of Clusters; 2006; In Preparation
[2]-Floudas, C. A.; Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications; Oxford University Press; 1995
[3]-Floudas, C. A.; Aggarwal, A.; Ciric, A. R.; Global Optimum Search for Non Convex NLP and MINLP Problems; Comp. & Chem. Eng.; 13(10); 1989; pp. 1117-1132
[4]-Tan, M. P.; Broach, J. R.; Floudas, C. A.; Microarray Data Mining: A Novel Optimization-Based Iterative Approach to Uncover Biologically Coherent Structures; 2006; In Preparation