(118c) Comparison of Three High-Level Microarray Statistical Analysis Methods for Disease Mechanism Identification | AIChE

(118c) Comparison of Three High-Level Microarray Statistical Analysis Methods for Disease Mechanism Identification

Authors 

Schultz, D. - Presenter, Aristotle University of Thessaloniki
Frydas, I., Aristotle University
Karakitsios, S., Aristotle University of Thessaloniki
Sarigiannis, D., Aristotle University
Transcriptomics investigates the transcriptome, the complete set of RNA transcripts that are produced by the genome under specific conditions. To date, integrated omics technologies such as transcriptomics have been instrumental in a host of applications, including disease discovery, and as a field have undergone large advancements in recent years. One of the main techniques used in transcriptomics is microarrays, which peaked in use in the early 2010s and remains a sound, cost-effective option that can be paired with targeted RNA-Seq to obtain reliable and robust results. Microarrays hybridize the transcripts of fluorescently labeled mRNA to an array with a defined set of complementary short nucleotide oligomers (“probes”). Resulting fluorescence intensity emitted from each probe location is indicative of the transcript abundance for that probe sequence. The genes associated with each probe can then be determined and information such as differentially expressed genes (DEGs; the difference in expression of genes between two arrays) amongst conditions can be compared. Microarrays allow the hybridization of tens of thousands of transcripts while offering a greatly reduced effort and cost per gene. However, omics techniques also generate huge quantities of data that require advanced statistical methods for analysis. Therefore, a major issue that arises during transcriptomic research is the data analysis and ideal method to process raw data, identify validated DEGs, and generate results with limited rates of

Type I and Type II error. We compared three statistical techniques, including Significant Analysis of Microarrays (SAM; R software), Linear Models for Microarray Data (LIMMA; R software), and Moderated T-Test (Agilent GeneSpringâ„¢ software) to compare results. To do so, we analyzed microarray (Agilentâ„¢) data generated from real datasets from experiments within our lab that aim to detect the molecular mechanisms involved in metabolic disorders associated with environmental contaminants. In these experiments, zebrafish (Danio rerio) larvae were exposed to either a pharmaceutical (amiodarone) or environmental pollutant (di-2-ethylhexyl phthalate (DEHP)) that led to transcriptomic alterations in metabolic pathways. Overall, this work is to efforts to generate reliable computational models and systems biology for the prediction of a host of metabolic diseases.

During experiments, 3-day post-fertilization (dpf) larvae were exposed to a carrier control, amiodarone, or DEHP for 72 hours until 5 dpf. RNA was extracted and stored at -20°C until microarray analysis. Subsequently, samples were processed following the One-Color Microarray-Based Gene Expression Analysis (Low Input Quick Amp Labeling) Protocol version 6.9.1 supplied by Agilent Technologies. Samples were hybridized using Agilent’s Gene Expression Hybridization Kit (Agilent 5188-5242) to Agilent SurePrint Gene Expression 4 x 60k Microarray Kit, design ID: 026437 (Agilent Technologies Inc., CA). Microarrays were read using Agilent SureScan Microarray Scanner™ (G2600, Agilent Technologies, Inc., CA) and analyzed using GE 1200 one-color protocol in the Agilent Feature Extraction™ software. After feature extraction, the raw data was exported and analyzed with a Moderated T-Test, SAM, and LIMMA based on the literature (Chrominski and Tkacz 2015).

While there is no deficit of methodologies available to analyze these large data sets, a few garner the most use. Most frequently, R software is used, which is an open-source statistical analysis and graphical software that is very popular because it is free and allows for far greater manipulation of data compared to commercially licensed softwares. However, the R interface requires at least basic capabilities in computer coding to generate a code tailored from generic pipelines to the users’ data and troubleshooting is generally via online help forums. Conversely, Agilent GeneSpring™, which is a commercially available statistical analysis software is used. GeneSpring™ is more user friendly and has the option of tech support. However, GeneSpring™ is rigid in its data manipulation capabilities, does not always offer non-parametric options, and can be costly.

Presently, data were analyzed using Linear Models for Microarray Data (LIMMA, R software; Ritchie et al 2015) and Significant Analysis for Microarray (SAM; Tusher, Tibshirani, and Chu 2001). LIMMA utilizes linear model fitting in order to control for study design and applies empirical Bayesian functions to calculate the gene-wise test statistic (moderated t-test statistic). This method lends more statistical power as it borrows information from between gene comparisons. LIMMA, while it aims to overcome shortcomings based on parametricity assumptions, is still theoretically a parametric test. SAM is a non-parametric adjustment on the t-test that applies an ad hoc modification before analysis. For each gene, SAM produces a test statistic value based upon the observed value’s deviation from the expected value. SAM determines significance based on the deviation of the observed data from the expected value using numerous permutations. SAM was specifically created to determine statistical significance in gene expressions between groups. SAM is similar to a t-test, although SAM utilizes non-parametric statistics mainly owing to the fact that microarray data are not normally distributed in the vast majority of cases (Tusher, Tibshirani, and Chu 2001). In GeneSpring™, a moderated t-test (MTT) was completed, which is modelled after LIMMA’s moderated t-test. However, this analysis is not model-based and therefore does not account for study design. Across all methods, preprocessing

(background correction, normalization, probe filtering, outlier analysis, and batch effect analysis) was undertaken where possible.

Agilent microarrays contain tens of thousands of genes, including some that are predicted or as of now unknown. Table 1 outlines the total number of DEGs found by the different analysis methods as well as those that could be mapped (mapped to an EntrezID using org.Dr.eg.db). Compared to control treatments, the MTT detected the most DEGS in amiodarone treatments, while LIMMA detected the most DEGs in DEHP treatments. Using R, the lists of DEGs were compared across the methods. This comparison found that DEGs that were detected by the SAM method made up a larger proportion of the total detected by that method. However, when the MTT was omitted from the comparison, there was simultaneously a higher total number of shared DEGs between LIMMA and SAM, as well as a much higher percentage. Results suggest that, if assuming that the percentage of overlapping DEGs are indicative of true positives, SAM outperforms both LIMMA and MTT. There is a high accordance between SAM- and LIMMA-detected DEGs (71.1 – 90.9%). However, when the MTT is included, the level of accordance across the groups falls drastically (39.3 – 83.0%). These data suggest that SAM and LIMMA are better than MTT in detecting true positives, which is in accordance with Chrominski and Tkacz (2015), which also found similar results using artificially generated datasets. However, they also found a low level of agreement in real-world samples that were cross analyzed. Therefore, relying on real-world data may not be useful and instead, statistical analysis decisions should be based on artificially generated and tested data.

Some important considerations when planning an analysis are that GeneSpringâ„¢ MTT is very rigid in its data manipulation capabilities, particularly when analyzing non-parametric data, but is more user friendly. On the contrary, R allows for greater manipulation of data and more robust pre-processing options and therefore, lends much better insight into the nuances of the data that one is working with. Unfortunately, because it requires at least basic knowledge of coding and a decent amount of input effort to manually tailor generic pipelines, there is an obvious barrier to entry. LIMMA is a very frequently used method with numerous help forums and generic pipelines available and can also account for co-variates and experimental design through modeling. SAM requires preprocessing of the raw data with additional packages/software such as LIMMA or GeneSpringâ„¢. Overall, combining multiple methods may improve the confidence of results.

Unfortunately, there are many factors that can influence these types of analyses. For example, it is important to note that LIMMA models the data between the genes and so probe filtering based on variance is not recommended because this will still lend information to the model and removal could limit the extrapolation. On the contrary, in GeneSpring™ MTT, probe filtering is important because the more comparisons you have, the more information you will lose due to your multiple testing corrections as the algorithm is unable to use the intergene information as efficiently as LIMMA. Also, the methodology of the batch effect corrections is an important distinction as well – while the batch effect application in the limma package is able to correct the data, the ComBat function in GeneSpring™ actually modifies the data and therefore, while the ComBat function is available in R, it is not recommended for use. With regards to SAM, because it is based on permutation testing, it is difficult to obtain a completely replicatable outcome ie each analysis, even when set to a high number of permutations, will provide a slightly different outcome each time the analysis is run albeit in this data set, the difference was generally less than 10 DEGs. Therefore, it is difficult to fully compare these procedures and selection of an appropriate method is researcher- and lab-specific.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Emeritus Members $105.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00