(618a) Extracting the Biomarker Potential of N-Glycans with a Machine Learning Framework Applied to Colorectal Cancer | AIChE

(618a) Extracting the Biomarker Potential of N-Glycans with a Machine Learning Framework Applied to Colorectal Cancer

Authors 

Davies, J., Imperial College London
Nakai, S., Imperial College London
Kontoravdi, C., Imperial College London
Problem

  1. Background

Colorectal cancer (CRC) ranks as the third most frequently diagnosed cancer worldwide, accounting for 10.2% of all cases in 2018, and the second leading cause of cancer-related deaths, constituting 9.2% of total cancer deaths in the same year. Existing CRC screening methods present challenges, including invasiveness, reduced sensitivity, and elevated costs. The effectiveness of traditional treatments, such as surgery, chemotherapy, and radiation therapy, is hampered by the inability to detect CRC in its early stages, emphasizing the need to pinpoint specific molecular targets to devise more efficient diagnostic procedures and therapies. The molecular basis of CRC has become better understood due to advances in genomic and proteomic research; however, the discovery of well-validated and clinically useful biomarkers for CRC remains limited. The emerging field of glycomics offers a promising avenue for study and is gaining momentum in cancer research. Although glycans appear to play a role in tumor proliferation, the understanding of the underlying mechanisms significantly trails that of other critical cell components, such as genes and proteins, in large part due to the complexity of N-glycan biosynthesis, which occurs in a non-template driven manner and involves multi-step concerted action of competing enzymes and metabolic co-substrates. Aberrant glycosylation has been observed for CRC patients, and human serum N-glycans have been proposed as potential biomarkers for diagnosis and advanced therapeutic intervention. N-glycans found on immunoglobulin G (IgG), a glycoprotein abundant in human serum, are of particular interest. In healthy adults, IgG makes up about 75% of total serum immunoglobulins. The convenience of sample collection, combined with indications of CRC's associations with aberrant glycosylation, positions IgG N-glycans as an ideal candidate for diagnostic and biomarker discovery efforts, which can be facilitated by the use of machine learning methods.

  1. Scope

This work proposes a comprehensive machine learning framework for the CRC diagnosis and patient stratification using a dataset from the Study of Colorectal Cancer in Scotland (SOCCS) study (1999-2006), containing IgG N-glycomic profiles of cancer patients and controls. This framework is developed with a focus on interpretability to facilitate the identification of potential N-glycan-based biomarkers, thus complementing existing CRC screening methods for increased specificity and sensitivity.

Methods

The dataset used for the development of the proposed machine learning framework comprised the relative abundances of 24 N-glycan structures normalized by the total area, the body mass index (BMI), and known glycan covariates such as age and gender for 1,413 patients with pathologically confirmed colorectal cancer and 538 matching healthy controls. Limitations of the dataset include the absence of BMI data for 238 samples and a notable imbalance in the number of controls and cancer patients over the age of 60 years. To avoid having missing values in the dataset, an elaborate pre-processing procedure is carried out, where missing BMI data are iteratively imputed accounting for the information present in the rest dataset features. Additionally, class imbalance with respect to healthy controls is overcome by a modified version of the Synthetic Minority Over-Sampling Technique (SMOTE), which incorporated the knowledge for the distributions of the age, gender, and BMI features across healthy and patient samples to ensure that model classification decisions are attributable to differences in the N-glycomic profiles and not to artificial differences in these covariates as a result of data augmentation. Furthermore, three data scaling methods are used, namely min-max scaling, standard scaling and robust scaling, in order to assess their influence on the performance of the machine learning algorithms used in this study. Regarding the latter, in this study five machine learning methods are trained and compared, namely Random Forest (RF), Support Vector Machines (SVMs), Logistic Regression (LR), Extreme Gradient Boosting (XGBoost) and Soft Voting Ensembles (SVEs). These methods include both non-ensemble and ensemble techniques, which possess different inductive biases, thus constituting a diverse spectrum of classification algorithms that can be used to achieve satisfactory specificity and sensitivity. To avoid data leakage, nested cross-validation (NCV) is employed for hyperparameter tuning and model evaluation, and generalization performance is assessed using learning curves. Finally, biomarker potential is evaluated using permutation feature importance, which is a global agnostic method for interpretability that enables the identification of the most influential N-glycan features in the classification of disease outcome. All computational analysis is carried out using the Python programming language.

Results & Implications

The results of this study support the existing potential for the development of machine learning-based diagnostic tools for colorectal cancer using blood-derived IgG N-glycomic data. Without data augmentation and discarding all samples for healthy individuals over the age of 60 years, an equally-weighted SVE classifier comprising a base XGBoost classifier with min-max scaling and a base LR classifier with robust scaling achieved a mean Area Under the Curve (AUC) score of 72.3% in diagnosing colorectal cancer patients. After applying the proposed data augmentation technique, the XGBoost classifier achieved a high mean AUC score of 92.0%, thus demonstrating a proof of concept for a supplementary diagnostic tool that could be developed with large, clinically valid datasets. Consequently, future research on the diagnostic potential of glycomic data should focus on acquiring more data points for controls and colorectal cancer patients, along with clinically relevant features such as age, gender, and BMI. Regarding patient stratification, a RF multiclass classifier trained on the non-augmented dataset was able to distinguish early-stage cancer patients from healthy controls with a mean AUC score of 70.1%, thus showing promise in the ability to provide timely CRC prognosis. Finally, permutation feature importance analysis suggested that alterations in the core-fucosylation of IgG glycans could serve as a potential biomarker, supporting existing research findings on the glycosylation characteristics linked to CRC. The proposed framework is designed to be adaptable for other diseases where blood-derived IgG N-glycomic data may be pertinent. Furthermore, due to the observation of aberrant glycosylation in multiple cancer types, this framework has the potential to serve as a general cancer screening tool.