(458h) An Optimization-Based Feature Selection Methodology for the Discovery of Biomarkers from High-Dimensional Data in Clinical Applications | AIChE

(458h) An Optimization-Based Feature Selection Methodology for the Discovery of Biomarkers from High-Dimensional Data in Clinical Applications

Authors 

Guzman, Y. A. - Presenter, Princeton University
Jayachandran, D., Purdue University
Ramkrishna, D., Purdue University
Floudas, C. A., Princeton University

        Biomarkers are measurable indicators of biological processes that can be applied in clinical settings for disease diagnosis and prognosis, risk-factor assessment, disease staging, and as indicators of treatment efficacy.  The usage of high-throughput –omics platforms enables large-scale studies that can generate expansive datasets with thousands of candidate biomarkers (i.e., data features).  These methodologies provide a great opportunity for untargeted biomarker discovery in clinical applications, but the huge volume of data yields a subsequent data analysis problem.  For a biomarker to be accepted into clinical praxis, it must be subjected to large-scale, often expensive clinical validation stages [1]; the ultimate success of a discovery-phase biomarker study lies in its ability to produce a small subset of biomarkers with the greatest probability of success in large-scale targeted studies [1,2].  This high data dimensionality per sample is almost always coupled with a comparatively low number of samples in the discovery phase, yielding a statistically difficult feature selection problem with a high probability of overfitting or of selecting data artifacts as meaningful candidates [3].  In response, a number of feature-selection algorithms have been proposed in the context of biomarker selection with varying levels of efficacy [4-7].

        Building on a previous study in which mixed-integer linear optimization models were proposed to classify healthy and diseased samples [8], we developed a novel optimization-based methodology for candidate biomarker selection.  We evaluated our methodology using experimental data sets with a priori known discriminating data features from the literature and evaluate our method in the context of selection stability and accuracy [7,9,10].  Our method has been applied to a proteomics dataset of plasma samples from breast cancer patients to select diagnostic biomarkers [11].  The methodology yielded a multiple-reaction monitoring assay for further clinical validation.  We also present the application of our method to a metabolomics study of patients undergoing chemotherapy.  The utilization of a subset of metabolites that can predict which patients will develop chemotherapy-induced toxicity can guide the treatment decisions of practitioners.

References:

1. Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol. 24(8):971-83 (2006).

2. Srinivas PR, Verma M, Zhao Y, Srivastava S. Proteomics for cancer biomarker discovery. Clin Chem. 48(8):1160-9 (2002).

3. Rubingh CM, Bijlsma S, Derks EP, Bobeldijk I, Verheij ER, Kochhar S, Smilde AK. Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics. 2:53-61 (2006).

4. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 23(19):2507-17 (2007).

5. Hilario M, Kalousis A. Approaches to dimensionality reduction in proteomic biomarker studies. Brief Bioinform. 9(2):102-8 (2008).

6. Baek S, Tsai CA, Chen JJ. Development of biomarker classifiers from high-dimensional data. Brief Bioinform. 10(5):537-46 (2009).

7. Christin C, Hoefsloot HC, Smilde AK, Hoekman B, Suits F, Bischoff R, Horvatovich P. A critical assessment of feature selection methods for biomarker discovery in clinical proteomics. Mol Cell Proteomics. 12(1):263-76 (2013).

8. Baliban RC, Sakellari D, Li Z, Guzman YA, Garcia BA, Floudas CA. Discovery of biomarker combinations that predict periodontal health or disease with high accuracy from GCF samples based on high-throughput proteomic analysis and mixed-integer linear optimization. J Clin Periodontol. 40(2):131-9 (2013).

9. He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem. 34(4):215-25 (2010).

10. Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 26(3):392-8 (2010).

11. Riley CP, Zhang X, Nakshatri H, Schneider B, Regnier FE, Adamec J, Buck C. A large, consistent plasma proteomics data set from prospectively collected breast cancer patient and healthy volunteer samples. J Transl Med. 9:80 (2011).