(198c) A Novel Approach for Modified Protein Identification | AIChE

(198c) A Novel Approach for Modified Protein Identification

Authors 

Baliban, R. - Presenter, Princeton University
DiMaggio, P. A. Jr. - Presenter, Princeton University
Garcia, B. A. - Presenter, Princeton University
Floudas, C. A. - Presenter, Princeton University


Over the past several years, various tools have been developed for proteomics analysis at the peptide level using tandem mass spectrometry. However, an analysis of a sample data set will not be complete without identification of the comprehensive protein list that is present in the sample. Further, it is imperative that each of these proteins be labeled with all corresponding post-translational modifications (PTMs) to further facilitate the analysis of the dynamic proteome. Though a variety of instruments are available for this analysis, the most accurate measurements come from using high-resolution data from QTOF, LTQ-FT, or Orbitrap instruments. The high accuracy of these instruments can yield better peptide identification, but it is still imperative to ensure that the false positives are minimized at the protein level. A common methodology is to group proteins based on homology, so we have look to recent developments in bi-clustering techniques [1] to provide a useful way to reduce the false positives by grouping together homologous proteins.

To fully analyze a LC-MS/MS data set, we have developed PILOT_PROTEIN, a novel method designed to extract a comprehensive protein list and subsequentaly analyze the uncharacterized MS/MS using an untargeted post-translational modification search. PILOT_PROTEIN [2] will initially use the results of the hybrid peptide identification algorithm PILOT_SEQUEL [3] to generate a list of proteins present in a data sample. The PILOT_SEQUEL algorithm will output a rank-ordered list of the top database peptide sequences and their corresponding proteins [3]. PILOT_PROTEIN will then construct the complete list of possible proteins and assign each protein all of the appropriate peptides from the data set [2]. A protein score is calculated based on the individual scores of the peptides. Due to the large number of proteins on the comprehensive list, a thorough false positive elimination algorithm is employed. Initially, all protein scores are analyzed to see if they pass a minimum threshold. If the score is below the threshold, the protein is eliminated from the possible list. The second stage of the false positive filtering utilizes the clustering algorithm OREO [1] is to group proteins of similar sequence based on their Smith-Waterman alignment score. Only the protein that scores the highest for each cluster is considered as a valid protein while the other proteins are listed as homologues. Once the filtered protein list has been constructed, PILOT_PROTEIN, will analyze all MS/MS that were not annotated spectra using the PILOT_PTM algorithm. Using a universal list of modifications, PILOT_PTM will assign to a template sequence the set of modifications that best explains the experimental data. Derivation of the template sequences will begin with sequence tag generation followed by extraction from the protein list.

To verify the protein prediction capability of the method, PILOT_PROTEIN was initially tested on several high-resolution data sets including (a) 37 spectra fragmented via QTOF and (b) 401 spectra fragmented via Orbitrap [3]. Large-scale LC-MS/MS data sets were then used from the Standard Protein Mix Database which includes (c) 10 complete LC-MS/MS from Orbitrap, (d) 55 complete LC-MS/MS from LTQ-FT, and (e) 30 LC-MS/MS from QTOF [4]. Each data set contains a known set of proteins and sets (c), (d), and (e) contain several low-abundance contaminants. The results were quantified by noting how many of the known proteins are found, how many contaminants were found, and the number of false positives (any additional proteins). PILOT_PROTEIN is able to extract a large number of the possible proteins while significantly reducing the number of false positives reported. The ability of PILOT_PROTEIN was then benchmarked on all five data sets using several known algorithms including InsPecT [5], X!Tandem [6], ProteinProspector [7], and VEMS [8]. PILOT_PROTEIN was able to report a higher number of proteins on average than any of the four competing algorithms and maintains a smaller list of false positives for each data set. The sequence tag generation, candidate peptide extraction, and untargeted modification search were finally tested on 50 LC-MS/MS total chromatin extraction data samples [2].

[1] P. A. DiMaggio Jr., S. R. McAllister, X. L. Feng, J. D. Rabinowitz, H. A. Rabitz, and C. A Floudas. BiClustering bia optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies. BMC Bioinformatics, 9(1):458, 2008.

[2] R. C. Baliban, P. A. DiMaggio Jr., M. D. Plazas-Mayorca, N. L. Young, B. A Garcia, and C. A. Floudas. A Novel Method for Untargeted Post-Translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry. Mol. Cell Proteomics, 9:764-779, 2010.

[3] P. A. DiMaggio Jr., B. Lu, J. R. Yates III, and C. A. Floudas. A Hybrid Method for Peptide Identification Using Integer Linear Optimization, Local Database Search, and Quadrupole Time-of-Flight or OrbiTrap Tandem Mass Spectrometry. J. Proteome Res., 7(4):1584?1593, 2008.

[4] J. Klimer, J. S. Eddes, L. Hohmann, J. Jackson, A. Peterson, S. Letarte, P R. Gafken, J. E. Katz, P. Mallick, H. Lee, A. Schmidt, R. Ossola, J. K. Eng, R. Aebersold, and D. B. Martin. The Standard Protein Mix Database: A Diverse Data Set To Assist in the Production of Improved Peptide and Protein Identification Software Tools. J. Proteome Res., 7(1):96?103, 2008.

[5] S. Tanner, H. Shu, A. Frank, L. C.Wang, E. Zandi, M. Mumby, P. A. Pevzner, and V. Bafna. InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectrometry. Anal. Chem., 77(14):4626?4639, 2005.

[6] R. Craig and R. C. Beavis. TANDEM: matching proteins with tandem mass spectra. Bioinformatics, 20(9):1466?1467, 2004.

[7] K. R. Clauser, P. R. Baker, and A. L. Burlingame. Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem., 71(14):2871?2882, 1999.

[8] R. Matthiesen, M. Lundsgaard, K. G. Welinder, and G. Bauw. Interpreting peptide mass spectra by VEMS. Bioinformatics, 19(6):792?793, 2003.