(693f) On the Incorporation of Direction and Magnitude in Statistical Correlation: Application to Gene-Microarray Data | AIChE

(693f) On the Incorporation of Direction and Magnitude in Statistical Correlation: Application to Gene-Microarray Data

Authors 

Subramaniam, S. - Presenter, University of California, San Diego


In statistical analysis of high-throughput data, such as gene-microarray data, correlation is a useful measure which can assist us in finding the similarities or differences between different experiments or components. Pearson-correlation is one of the most popular correlation measures used in the analysis of gene-microarray data and other such biological data. Pearson-correlation represents the similarity between the shapes of two time-courses (or data sequences in general). Such a correlation is not sufficient to decide whether or not two time-courses differ significantly from each other. For example, despite the similarity of shapes, if the data values in one time series mostly remain positive and negative in the other, then one would classify them as significantly different. Similarly, two profiles with the same sign (positive or negative mean values) with relatively larger difference in the individual values may be considered more similar to each other as compared to those with opposite signs but smaller differences in the data values. Hence a new correlation that includes the effect of sign (direction) of the time-courses and the effect of magnitude of the data values is needed.

Here, we propose a complex correlation which includes the effect of shape, magnitude and direction between two data sequences. Pearson correlation captures the shape similarity. The effect of magnitude is captured through the Euclidean distance. To include the effect of sign in the comparison (sign-factor), hyperbolic tangent (tanh) function is used as a smooth approximation of the sign function (sgn). The function is often used as a transfer function to model non-linear input/output relationships in artificial neural networks (1, 2). By using a scaling parameter, the tanh(.) function can be approximated as closely to the sgn(.) function as desired. Independent of the Pearson correlation, we compute a distance-amplified correlation based on the Euclidean distance and the sign-factor. Such an approach to computing magnitude-based similarity has been earlier used in signal processing and similarity-search-based fault identification in chemical processes (3). Next, a weighted average of the Pearson correlation and the distance-amplified correlation is computed. In order to perform statistical hypothesis testing, one needs to compute a probability distribution for the complex correlation proposed. There are two choices for generating the datasets to be used for computing the distributions. The first is to use completely random data, possibly using the mean and variance information from actual experimental datasets for which statistical-test is to be performed. The second is to use random permutations of the experimental data.

We have applied the above approach to mRNA data from mouse macrophage RAW 264.7 cells and Thioglycollate Elicited macrophages (TGEM) cells in response to treatment with KDO2 lipid A (a lipopolysaccharide analogue), generated by the LIPID MAPS genomics core (www.lipidmaps.org). We have found that the random permutation approach is better for generating the probability distributions. As expected, the distribution does not resemble any known popular distribution. The effect of the parameters and weights used in the correlation has also been studied and is being investigated further.

References

1. Mathworks. 1994. The Mathworks, Inc.© 1994 - 2004. Natick, MA. http://www.mathworks.com/.

2. Venkatasubramanian, V., and K. Chan. 1989. A Neural Network Methodology for Process Fault-Diagnosis. AIChE Journal. 35:1993-2002.

3. Maurya, M. R., R. Rengaswamy, and V. Venkatasubramanian. 2007. Fault diagnosis using dynamic trend analysis: A review and recent developments. Engineering Applications of Artificial Intelligence. 20:133-146.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

2010 Annual Meeting
AIChE Pro Members $150.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00
Food, Pharmaceutical & Bioengineering Division only
AIChE Pro Members $100.00
Food, Pharmaceutical & Bioengineering Division Members Free
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $150.00
Non-Members $150.00