(597e) Selecting High Quality Protein Structures From Diverse Conformational Ensembles | AIChE

(597e) Selecting High Quality Protein Structures From Diverse Conformational Ensembles

Authors 

Subramani, A. - Presenter, Princeton University
DiMaggio, P. A. Jr. - Presenter, Princeton University
Floudas, C. A. - Presenter, Princeton University


Protein structure prediction encompasses two major challenges: (1) the generation of a large ensemble of high resolution structures for a given amino acid sequence and (2) the identification of the structure closest to the native structure for a blind prediction. In this paper, we address the second challenge, by proposing an iterative novel Traveling Salesman Problem, TSP, based clustering method to identify the structures of a protein, in a given ensemble, which are closest to the native structure.

The method, denoted as ICON (Iterative Clustering approach for Optimal selection of near-Native structures) uses the idea of dihedral angle based clustering for identification of near-native structures by implementing the rigorous global rearrangement clustering as presented in DiMaggio et al. (2008)[1], in the dihedral angle space. The main thesis behind ICON is that if two conformers of a protein are very similar to the native structure, they are likely to be similar to each other as well. However, if two protein structures are very dissimilar to the native structure, it is not necessary that they would be similar to each other. We implement this thesis by eliminating clusters of protein structures which are very dissimilar to each other. Once the optimal ordering of rows (i.e. protein conformers) is received from the TSP implementation of OREO [1], cluster boundaries in the path are determined. This can be carried out by two methods. The first method involves finding the ?knee? in the curve between hierarchical binding of elements to clusters, and the number of clusters[2]. The ?knee? is evaluated by fitting the curve on either side of the knee by straight lines, and finding the knee point where the deviation of the points to straight lines on either side of the knee is minimized. The second method involves finding the optimal number of clusters by starting with an initial set of cluster seeds, and assigning rows (i.e. conformers) to either the cluster preceding or following it in the optimal TSP output path. Once the number of clusters is estimated, the cluster medoid and average cluster radius for each cluster is evaluated using an integer linear programming formulation. Once the cluster medoids are estimated, the low density clusters are eliminated for the next iteration. The density of the clusters is estimated by evaluating the Cluster Concentration as a ratio of the number of elements in the cluster to the average cluster radius. A distribution of the Cluster Concentrations is made, and an iterative procedure based on holding onto tightly would dense clusters and eliminating loosely bound ones is implemented, by noting that these would lie on the opposite ends of this distribution. At the end of a set number of iterations, we collect the medoids of all the final set of clusters whose concentration is greater than the median. These are ranked based on the Cα- Cα [3] or the Centroid-Centroid force field [4], and the lowest 5 energies are reported.

ICON was tested on four data sets: (a) 1400 proteins with high resolution decoys, (b) medium to low resolution decoys from Decoys ?R' Us, (c) medium to low resolution decoys from the first principles approach, ASTROFOLD, and (d) selected targets from CASP8. The extensive tests demonstrate that ICON can identify high quality structures in each ensemble, regardless of the resolution of conformers. In a total of 1454 proteins, with an average of 1051 conformers per protein, the conformers selected by ICON are, on an average, in the top 3.5% of the conformers in the ensemble. This method has been tested in comparison with a state-of-the-art clustering method, SPICKER [5]. In 81.5% of the cases of the high resolution data set, ICON performs better in selecting near-native structures from the ensembles, while in 86.2% of the cases, ICON performs at least as well as SPICKER.

Bibliography

[1] PA DiMaggio Jr et al. (2008) Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics 9:458

[2] S Salvador and P Chan (2004) Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. Proc. IEEE International Conference on Tools with Artificial Intelligence, 576-584

[3] R Rajgaria et al. (2006) A novel high resolution C-alpha C-alpha distance dependent force field based on a high quality decoy set. Proteins 65: 726-741

[4] R Rajgaria et al. (2008) Distance dependent centroid to centroid force fields using high resolution decoys. Proteins 70:950-970

[5] Y Zhang and J Skolnick (2004) SPICKER: A Clustering Approach to Identify Near-Native Protein Folds. J Comput Chem 25:865-871