(101a) The Language of Mass Spectra: An Application to Glycomics | AIChE

(101a) The Language of Mass Spectra: An Application to Glycomics

Authors 

Gunawan, R. - Presenter, SUNY Buffalo
Neelamegham, S., University at Buffalo, State University of New York
Abtheen, E. A., University at Buffalo, State University of New York
Chen, C., University at Buffalo-SUNY
In the last year, there has been a lot of excitement around generative artificial intelligence, especially with the release of large language models (LLMs), such as Generative Pre-Trained Transformers (GPT). The transformer architecture behind these LLMs uses the self-attention mechanism to learn context-dependent representation (embedding) of words, where the meaning of a word depends on its context that is defined by its position and the other words and their positions in the text (Vaswani et al, 2017). Of note, this architecture has been successfully adapted to numerous applications outside natural language processing (NLP), including bioinformatics and medicine. In this work, we developed transformer-based language models of mass spectra (MS) for glycomics, called GlycoBert and GlycoBart.

Glycosylation is the enzymatic process that attaches and modifies glycans to proteins and lipids. This post-translational modification process influences protein structure, function, and stability, and therefore has many important implications in health and disease. In medicine, glycosylation is connected to different applications, from diagnostics to therapeutics and vaccine design. For example, aberrant glycosylation patterns are key markers and drivers of cancer, making them targets for innovative diagnosis and treatments. Similarly, understanding the glycosylation of viral proteins can lead to more effective vaccines. Thus, enabling accurate glycomics profiling has broad impact in medical science.

Mass spectrometry is an indispensable tool for glycomics. A mass spectrum encodes information for the structure and composition of the precursor molecules (glycans) in its mass-over-charge (m/z) peak position and intensities. In MS data analysis, the significance of a single peak is linked to the presence and intensities of other peaks. Accounting the contextual relationship across m/z peaks is crucial for deducing the composition of the precursor molecule. The interdependency among m/z peaks in mass spectra mirrors how context shapes the meaning of a word in a sentence or paragraph. The transformer architecture using the self-attention mechanism has been proven to be extremely efficacious in handling context-dependent information for various inference tasks in NLP (Vaswani et al, 2017). For this reason, we leveraged transformers so that each m/z peak can be analyzed in the full context of the entire m/z peaks in a spectrum, enabling a nuanced and accurate inference of the precursor molecule.

The first step in applying the transformer architecture to MS data is tokenization of the mass spectra into a suitable format. We formulated a tokenization procedure that transforms a given mass spectrum into a sequence of tokens, as illustrated in Figure 1. The m/z values are converted into indices by binning which produces a sequence of m/z bin indices that mirror a sequence of words in a sentence. Besides the m/z peaks, information related to experimental settings, relative retention time, and precursor mass are included in the “MS sentence”. The experimental metadata include liquid chromatography type, derivatization, ion mode, ionization type, modification, ion trap, fragmentation mode, glycan type, (relative) retention time, and precursor mass. The relative retention time and precursor mass are also binned and are represented in the sentence by their bin indices. We compiled the vocabulary for MS—the set of all possible words in MS sentences—using a recent MS dataset for glycomics called CandyCrunch (Urban et al., 2023). Each word is assigned a unique token ID, allowing encoding of MS sentences as inputs for transformers.

Among the state-of-the-art transformer-based NLP models, BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) is particularly attractive for the interpretation of MS data. The bidirectional capability of BERT allows for a nuanced contextual analysis of MS peak patterns by considering both the preceding (higher intensity) and subsequent (lower intensity) peak information. Not to mention, BERT’s encoder-based architecture provides flexibility in the type of inference tasks using the pre-trained model. We developed GlycoBERT, a BERT model for MS-to-glycan inference. GlycoBERT assigns an input mass spectrum into one of the possible glycan structures from the training dataset (n = 3307). As a classifier GlycoBERT is unable to perform de novo glycan inference, i.e. this method is unable to predict glycan structures that are not in the training dataset. To address such an issue, we further developed GlycoBART which employed an extension of BERT model for sequence-to-sequence inference, called BART (Bidirectional and AutoRegressive Transformers) (Lewis et al., 2019). GlycoBART is formulated as a conditional sequence generation problem.

For model training, we leveraged a recent dataset of tandem MS dataset for glycomics called CandyCrunch (Urban et al., 2023). CandyCrunch dataset comprises ~505K mass spectra curated from GlycoPost and the literature. After data filtering that involved removal of spectra with unreliable glycan labels and those with missing experimental metadata, we generated train and test datasets with 410K and 72K (85:15 ratio) MS2 spectra, respectively. GlycoBERT was trained using a maximum sequence length of 512 with 12 hidden layers, each having 12 attention heads, and with a hidden latent size of 768. The training was performed on a GPU server with 4 NVIDIA Quadra RTX A5000s. The structural accuracy for the test dataset reached 95.3%. In contrast, CandyCrunch model trained on the same dataset using Temporal Convolution Neural network (TCN) achieved 87.7% accuracy (Urban et al., 2023). This finding demonstrates the promise of transformer based LLM for MS analysis.

For GlycoBART, we further developed a tokenizer for glycan structures. In this case, each glycan structure is represented by its antennae. For example, the structure “Neu5Ac(a2-3)Gal(b1-3)[Neu5Ac(a2-6)]GalNAc” is decomposed into “Neu5Ac(a2-3)Gal(b1-3)GalNAc Neu5Ac(a2-6)GalNAc”. Subsequently, the monosaccharides and the linkages were tokenized into individual words using a bespoke glycan vocabulary. Here, we curated 67 tokens for monosaccharides and linkages. Our assessment showed that GlycoBart generated highly accurate predictions, reaching 93.5% test accuracy. More importantly, GlycoBart is able to perform de novo glycan inference for spectra and glycans that do not exist in the training dataset. This generative capability is a significant advance over methods like CandyCrunch and GlycoBert.

References:

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, 1810.04805.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv e-prints, 1910.13461.

Urban, J., Jin, C., Thomsson, K. A., Karlsson, N. G., Ives, C. M., Fadda, E., & Bojar, D. (2023). Predicting glycan structure from tandem mass spectrometry via deep learning. bioRxiv, 544793.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. arXiv e-prints, arXiv:1706.03762. https://ui.adsabs.harvard.edu/abs/2017arXiv170603762V

Figure Legend:

Figure 1. Language models of Mass Spectra. Mass spectra from tandem mass spectrometry are parsed into MS sentences and subsequently, these are tokenized to provide inputs for GlycoBert and GlycoBart models. GlycoBert performs a sequence classification where a MS2 spectrum input is assigned a glycan label (one out of a total of 3307 labels). In contrast, GlycoBart performs conditional sequence generation where a glycan sequence (structure) is generated given the input of MS2 spectrum.