(264g) A within Slide Normalization Strategy for cDNA Microarray Data Based on Ratio Statistics | AIChE

(264g) A within Slide Normalization Strategy for cDNA Microarray Data Based on Ratio Statistics

Authors 

Sengupta, T. - Presenter, Indian Institute of Technology Bombay
Bhushan, M., Indian Institute of Technology Bombay
Wangikar, P. P., Indian Institute of Technology Bombay

A microarray experiment is a high throughput approach for simultaneously investigating the expression of a large number of genes. The number of genes probed in a single experiment can be of the order of thousands. Microarray technology is widely used for a variety of objectives in both fundamental and applied research. For example, microarrays are being used in medical research to gain a better insight into disease diagnosis [1].

            Microarray experiments can be performed in different ways for different ends. In this work, we restrict our attention to microarray experiments performed using two samples with the goal of studying gene expression in one sample relative to the other. Such experiments are typically performed as follows. First, all the mRNA from each of the two samples is extracted and each pool of mRNA is labeled with a fluorescent dye. The dyes are usually red and green in color. Now, the two pools are allowed to hybridize with probes present at specific locations in a microarray slide. The locations are referred to as spots or features. Each spot contains numerous copies of a particular molecule. In this work, we restrict our attention to cDNA microarrays implying that the molecules present at spots are cDNA molecules. Different spots contain different cDNA molecules. At a spot, copies of the same mRNA molecule from the two samples hybridize with copies of the cDNA molecule present at the spot. This mRNA molecule has sequence of bases complementary to the sequence of bases of the cDNA molecule at the spot. Since each mRNA molecule is coded by a gene, a spot can thus be associated with a gene. Now due to the fluorescent dyes, radiation at red and green wavelengths will emanate from a spot. The intensity of emitted radiation at a wavelength is commensurate with the quantity of the mRNA molecule in the corresponding sample. The expression ratio for a spot is the ratio of the representative red and green intensity values emanating from the spot. It thus gives a measure of the relative abundance of the mRNA molecule associated with the spot. The expression ratio is thus typically used to ascertain relative gene expression. In the absence of errors, an expression ratio of unity would imply a constitutively expressed gene. However, as in any experiment, microarray experimental data is also corrupted by noise i.e. random errors. Examples of sources of such errors in microarray experiments are cross-hybridization, dye bias, etc. [2]. Owing to such factors, a constitutively expressed gene can have an expression ratio different from unity. Normalization refers to the process of a quantitative assessment of the noise in microarray experimental data. Once normalization has been performed on microarray data, it is possible to inspect expression ratios and draw meaningful conclusions such as whether a gene is constitutively expressed or not. The focus of this work is within-slide normalization of two sample microarray experimental data. The term “within-slide” implies that data from a single microarray slide only is analyzed for normalization purposes. We note that two sample within-slide normalization strategies have wide applicability since the usual goal in microarray experiments is to study relative gene expression rather than assess absolute mRNA levels. For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG) expression database (http://www.genome.jp/kegg/expression) has a large repertoire of two sample microarray experimental data under various conditions for several organisms and all such data can be normalized by within-slide normalization.

            Several techniques pertaining to within-slide normalization exist in literature. A class of within-slide normalization techniques exist which assume that the experimental data is normally distributed. Examples of such strategies are normalization techniques which use different types of a two sample t-test such as the Welch t-test [3] and the Student's t-test [4]. Another class of within-slide normalization techniques exist, such as Significance Analysis of Microarrays i.e. SAM [5] and Ranking Analysis of Microarrays i.e. RAM [6], which do not assume that the data conforms to some distribution. In this work, we focus our attention to a particular parametric within-slide normalization technique existing in literature. This parametric normalization technique, proposed by [7], uses statistical hypothesis testing to identify constitutively expressed genes. Amongst other assumptions, the most important assumption in the model proposed by [7] is that the emanating red and green intensities at a spot are normally distributed such that the tails of the density functions to the left of the origin are negligible. This assumption regarding the tails of the density functions is true when the model parameter coefficient of variation assumes values lower than 0.3 for both the red and green wavelengths [7]. The coefficient of variation at a spot for a particular wavelength is defined as the ratio of the standard deviation and the mean of the fluorescent intensities emanating from the spot at that wavelength. In this work, we propose a modification of the model proposed by [7] by removing the requirement that the coefficient of variation be constrained to values lower than 0.3. The removal of this imposition is effected by using probability density functions that are different from the ones proposed by Chen et al. [7]. Our probability density functions are truncated normal instead of normal. In the original strategy, the requirement that the coefficient of variation be less than 0.3 restricts the applicability of the strategy to data having a low coefficient of variation. Our modification removes such an imposition and can thus be applied to expression data with high coefficient of variation. A high coefficient of variation is expected when expression ratios assume a wide range of values owing to high variability in mRNA levels among genes. The only requirement of the proposed modification, as is also a requirement of the original strategy, is the condition that the number of genes probed should be large. When this is the case, the normality assumption on the data follows from the Central limit theorem.

               The original normalization strategy and the proposed modification were both applied to artificially generated data for which the value of the coefficient of variation used to generate the data was known. It was seen that the proposed approach successfully normalized the data irrespective of the value of the coefficient of variation used to generate the data. On the contrary, the normalization strategy proposed by [7] failed to normalize data generated using a high coefficient of variation. This result thus validated the correctness of our normalization strategy. We also applied both the normalization strategies to actual microarray experimental data pertaining to human lymphoblastoid cells [8] and also human liver tissue [9]. For these, our normalization strategy predicted high values of the coefficient of variation in certain cases. For such cases, the normalization approach of [7] was not able to normalize the data adequately. 

            In this work, we have thus successfully modified an existing normalization strategy to obtain a more general formulation that can be applied to expression data with high coefficient of variation as opposed to the original strategy. The proposed strategy has been validated by applying it on artificial data. The strategy was also able to successfully normalize expression data pertaining to human lymphoblastoid cells and liver tissue. The proposed approach can thus be applied to analyze microarray data where expression ratios assume a wide range of values owing to high variability in mRNA levels among genes.  

References

1. G. S. Cojocaru, G. Rechavi, and N. Kaminski, “The use of microarrays in medicine,” Isr. Med. Assoc. J. 3(4), 292–296 (2001).

2. S. Russell, L. A. Meadows, and R. R. Russell, Microarray Technology in Practice, Academic Press, Amsterdam (2009).

3. Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed, “Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation,” Nucleic Acids Res. 30(4), e15 (2002).

4. P. Pavlidis, Q. Li, and W. S. Noble, “The effect of replication on gene expression microarray experiments,” Bioinformatics 19(13), 1620–1627 (2003).

5. V. G. Tusher, R. Tibshirani, and G. Chu, “Significance analysis of microarrays applied to the ionizing radiation response,” Proc. Natl. Acad. Sci. U.S.A. 98(9), 5116–5121 (2001).

6. Y.-D. Tan, M. Fornage, and Y.-X. Fu, “Ranking analysis of microarray data: a powerful method for identifying differentially expressed genes,” Genomics 88(6), 846–854 (2006).

7. Y. Chen, E. R. Dougherty, and M. L. Bittner, “Ratio-based decisions and the quantitative analysis of cDNA microarray images,” Journal of Biomedical Optics 2(4), 364–374 (1997).

8. V. G. Cheung, L. K. Conlin, T. M. Weber, M. Arcaro, K.-Y. Jen, M. Morley, and R. S. Spielman, “Natural variation in human gene expression assessed in lymphoblastoid cells,” Nat. Genet. 33(3), 422–425 (2003).

9. X. Chen, S. T. Cheung, S. So, S. T. Fan, C. Barry, J. Higgins, K.-M. Lai, J. Ji, S. Dudoit, I. O. L. Ng, M. Van De Rijn, D. Botstein, and P. O. Brown, “Gene expression patterns in human liver cancers,” Mol. Biol. Cell 13(6), 1929–1939 (2002).