(118c) Comparison of Three High-Level Microarray Statistical Analysis Methods for Disease Mechanism Identification
AIChE Annual Meeting
2022
2022 Annual Meeting
Food, Pharmaceutical & Bioengineering Division
Biomolecular Engineering III
Monday, November 14, 2022 - 1:06pm to 1:24pm
Type I and Type II error. We compared three statistical techniques, including Significant Analysis of Microarrays (SAM; R software), Linear Models for Microarray Data (LIMMA; R software), and Moderated T-Test (Agilent GeneSpring⢠software) to compare results. To do so, we analyzed microarray (Agilentâ¢) data generated from real datasets from experiments within our lab that aim to detect the molecular mechanisms involved in metabolic disorders associated with environmental contaminants. In these experiments, zebrafish (Danio rerio) larvae were exposed to either a pharmaceutical (amiodarone) or environmental pollutant (di-2-ethylhexyl phthalate (DEHP)) that led to transcriptomic alterations in metabolic pathways. Overall, this work is to efforts to generate reliable computational models and systems biology for the prediction of a host of metabolic diseases.
During experiments, 3-day post-fertilization (dpf) larvae were exposed to a carrier control, amiodarone, or DEHP for 72 hours until 5 dpf. RNA was extracted and stored at -20°C until microarray analysis. Subsequently, samples were processed following the One-Color Microarray-Based Gene Expression Analysis (Low Input Quick Amp Labeling) Protocol version 6.9.1 supplied by Agilent Technologies. Samples were hybridized using Agilentâs Gene Expression Hybridization Kit (Agilent 5188-5242) to Agilent SurePrint Gene Expression 4 x 60k Microarray Kit, design ID: 026437 (Agilent Technologies Inc., CA). Microarrays were read using Agilent SureScan Microarray Scanner⢠(G2600, Agilent Technologies, Inc., CA) and analyzed using GE 1200 one-color protocol in the Agilent Feature Extraction⢠software. After feature extraction, the raw data was exported and analyzed with a Moderated T-Test, SAM, and LIMMA based on the literature (Chrominski and Tkacz 2015).
While there is no deficit of methodologies available to analyze these large data sets, a few garner the most use. Most frequently, R software is used, which is an open-source statistical analysis and graphical software that is very popular because it is free and allows for far greater manipulation of data compared to commercially licensed softwares. However, the R interface requires at least basic capabilities in computer coding to generate a code tailored from generic pipelines to the usersâ data and troubleshooting is generally via online help forums. Conversely, Agilent GeneSpringâ¢, which is a commercially available statistical analysis software is used. GeneSpring⢠is more user friendly and has the option of tech support. However, GeneSpring⢠is rigid in its data manipulation capabilities, does not always offer non-parametric options, and can be costly.
Presently, data were analyzed using Linear Models for Microarray Data (LIMMA, R software; Ritchie et al 2015) and Significant Analysis for Microarray (SAM; Tusher, Tibshirani, and Chu 2001). LIMMA utilizes linear model fitting in order to control for study design and applies empirical Bayesian functions to calculate the gene-wise test statistic (moderated t-test statistic). This method lends more statistical power as it borrows information from between gene comparisons. LIMMA, while it aims to overcome shortcomings based on parametricity assumptions, is still theoretically a parametric test. SAM is a non-parametric adjustment on the t-test that applies an ad hoc modification before analysis. For each gene, SAM produces a test statistic value based upon the observed valueâs deviation from the expected value. SAM determines significance based on the deviation of the observed data from the expected value using numerous permutations. SAM was specifically created to determine statistical significance in gene expressions between groups. SAM is similar to a t-test, although SAM utilizes non-parametric statistics mainly owing to the fact that microarray data are not normally distributed in the vast majority of cases (Tusher, Tibshirani, and Chu 2001). In GeneSpringâ¢, a moderated t-test (MTT) was completed, which is modelled after LIMMAâs moderated t-test. However, this analysis is not model-based and therefore does not account for study design. Across all methods, preprocessing
(background correction, normalization, probe filtering, outlier analysis, and batch effect analysis) was undertaken where possible.
Agilent microarrays contain tens of thousands of genes, including some that are predicted or as of now unknown. Table 1 outlines the total number of DEGs found by the different analysis methods as well as those that could be mapped (mapped to an EntrezID using org.Dr.eg.db). Compared to control treatments, the MTT detected the most DEGS in amiodarone treatments, while LIMMA detected the most DEGs in DEHP treatments. Using R, the lists of DEGs were compared across the methods. This comparison found that DEGs that were detected by the SAM method made up a larger proportion of the total detected by that method. However, when the MTT was omitted from the comparison, there was simultaneously a higher total number of shared DEGs between LIMMA and SAM, as well as a much higher percentage. Results suggest that, if assuming that the percentage of overlapping DEGs are indicative of true positives, SAM outperforms both LIMMA and MTT. There is a high accordance between SAM- and LIMMA-detected DEGs (71.1 â 90.9%). However, when the MTT is included, the level of accordance across the groups falls drastically (39.3 â 83.0%). These data suggest that SAM and LIMMA are better than MTT in detecting true positives, which is in accordance with Chrominski and Tkacz (2015), which also found similar results using artificially generated datasets. However, they also found a low level of agreement in real-world samples that were cross analyzed. Therefore, relying on real-world data may not be useful and instead, statistical analysis decisions should be based on artificially generated and tested data.
Some important considerations when planning an analysis are that GeneSpring⢠MTT is very rigid in its data manipulation capabilities, particularly when analyzing non-parametric data, but is more user friendly. On the contrary, R allows for greater manipulation of data and more robust pre-processing options and therefore, lends much better insight into the nuances of the data that one is working with. Unfortunately, because it requires at least basic knowledge of coding and a decent amount of input effort to manually tailor generic pipelines, there is an obvious barrier to entry. LIMMA is a very frequently used method with numerous help forums and generic pipelines available and can also account for co-variates and experimental design through modeling. SAM requires preprocessing of the raw data with additional packages/software such as LIMMA or GeneSpringâ¢. Overall, combining multiple methods may improve the confidence of results.
Unfortunately, there are many factors that can influence these types of analyses. For example, it is important to note that LIMMA models the data between the genes and so probe filtering based on variance is not recommended because this will still lend information to the model and removal could limit the extrapolation. On the contrary, in GeneSpring⢠MTT, probe filtering is important because the more comparisons you have, the more information you will lose due to your multiple testing corrections as the algorithm is unable to use the intergene information as efficiently as LIMMA. Also, the methodology of the batch effect corrections is an important distinction as well â while the batch effect application in the limma package is able to correct the data, the ComBat function in GeneSpring⢠actually modifies the data and therefore, while the ComBat function is available in R, it is not recommended for use. With regards to SAM, because it is based on permutation testing, it is difficult to obtain a completely replicatable outcome ie each analysis, even when set to a high number of permutations, will provide a slightly different outcome each time the analysis is run albeit in this data set, the difference was generally less than 10 DEGs. Therefore, it is difficult to fully compare these procedures and selection of an appropriate method is researcher- and lab-specific.
Checkout
This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.
Do you already own this?
Log In for instructions on accessing this content.
Pricing
Individuals
AIChE Pro Members | $150.00 |
AIChE Emeritus Members | $105.00 |
AIChE Graduate Student Members | Free |
AIChE Undergraduate Student Members | Free |
AIChE Explorer Members | $225.00 |
Non-Members | $225.00 |