(131c) A Non-Independent Model For Transcription Factor Binding Prediction | AIChE

(131c) A Non-Independent Model For Transcription Factor Binding Prediction

Authors 

Yang, E. - Presenter, Rutgers - The State University of New Jersey
Androulakis, I. - Presenter, Rutgers - The State University of New Jersey


The overall concept of transcription factor binding predictions is that of a classification problem[1]. However, in the field of supervised classification there has been very little work in improving the basic classifier. Most of the work being done has focused primarily upon expanding the scope of the classifier through the incorporation of external information such as phylogenetic conservation or the co-expression of genes[2, 3]. However, we contend that significant improvements can be made upon the initial classifier itself.

The primary motivation for creating this method is the observation that the naïve bayesian classifier functions very poorly as a classifier in terms of the the false positives (FP) vs the true positives (TP)[4]. Given the performance of the generalized ROC curve, the rejection of the majority of the false positives would involve the rejection of the majority of the true positives and vice versa. Therefore, we feel that even the predictions which utilize other facts such as phylogenetic foot-printing cannot address the general poor nature of the classifier, and that the success of phylogenetic foot-printing means that the inability of the standard PWM algorithms to distinguish between binding sequences and non-binding sequences is masked by the fact that phylogenetic footprinting works as a better classifier.

The initial classifier used for most supervised transcription factor binding predictions is based off of the naïve bayesian classifier in which each individual base is treated in an independent manner. However, given the nature in which proteins dock with the DNA strand in question, the base at position 3 affect the binding of another base at another position. Taking this into account, we have formulated a modified n-gram model which order the positions by their specificities to account for the long range interactions and utilize this as a classifier. This modified n-gram model can be shown to perform significantly better than the standard naïve bayesian model cutting down on the fraction of FP to TP's.

Secondly, utilizing this model, we have constructed a secondary feature which can be used to describe a given transcription factor prediction site, specifically the position dependence of the bases in relation to each other. For the majority of the known binding sites it was found that randomly permuting the sequence would generate a lower core than the original sequence, whereas a randomly generated sequence has a uniform chance of having sequence permutations scoring higher or lower than the base sequence. This allows for the construction of an all again one classifier system. A simplification of the model used a modified n-gram model in which the long range interactions were modeled by first sorting the bases by their specificity, i.e. preference for a specific base, and then applying a standard n-gram model with n = 2. This model showed a ~2 fold improvement in the classification accuracy as defined via the area under the ROC curve. By assessing whether a base permutation would lower the score, there was another slight improvement in the overall quality of the classifier though within a few percent of the non-specific model.

1. Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004, 5:276-287.

2. Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 2004, 32:W249-252.

3. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 2003, 34:166-176.

4. Beyer A, Workman C, Hollunder J, Radke D, Moller U, Wilhelm T, Ideker T: Integrated assessment and prediction of transcription factor binding. PLoS Comput Biol 2006, 2:e70.