(169cy) Unsupervised Computational Analysis of Major Histamine Complex II-Binding Epitope Sequences | AIChE

(169cy) Unsupervised Computational Analysis of Major Histamine Complex II-Binding Epitope Sequences

Authors 

Zhang, X. F., University of Massachussetts, Amherst
Abstract

The major histamine complex II (MHC-II) receptor class is critical to immune reaction in the human body [1]. A lack of MHC-II activation has been linked to increased cancer immune evasion [2], and many studies in recent years have focused on the relationship between functionality of MHC-II receptors and decreased functionality of many varieties of tumors [3] [4] [5] [6]. Many therapeutic drugs designed to influence immune reaction target MHC-II receptors. By analyzing epitope sequences of proteins known to bind to MHC-II receptors, a more thorough understanding of key sequence fragments for MHC-II receptor binding can be obtained. With a more thorough understanding of which sequence fragments are critical to binding, a more effective MHC-II-targeting drug candidate can be designed in silico faster than traditional in vitro methods, at a lower cost.

  1. Introduction

Adaptive immunity refers to the host immune system’s ability to adapt to novel threats over time, and major histamine complex (MHC) receptors are critical to this process [7]. There are two broad subclasses of MHC: MHC class I (MHC-I) and MHC class II (MHC-II). Both of these subclasses are glycoproteins, and they present antigenic peptides to T cells for destruction. MHC-I presents antigenic peptides to CD4+T cells, and MHC-II presents them to CD8+T cells. The major difference between MHC-I and MHC-II receptors is MHC-I glycoproteins are expressed on the surface of nearly all nucleated cells, whereas MHC-II glycoproteins are expressed primarily on the surface of antigen presenting cells (APCs) [8]. Due to slight differences in functionality between MHC-I and MHC-II, this study only focuses on MHC-II, to avoid excessive generalization between subclasses of MHC.

MHC-II molecules are produced in MHC class II compartments (MIICs), which form many different subclasses of MHC-II and dictate each molecule’s function [9]. Overall, MHC-II molecules are membrane anchored, with MHC-encoded, glycosylated alpha and beta subunits. They usually bind peptides 10-15 amino acids in length, but can bind much longer peptides, with hundreds of amino acids [7]. The MHC class of molecules is defined by its polymorphic nature – each individual MHC molecule could present hundreds if not thousands of different peptide antigens. The extent of this polymorphism varies across different vertebrate species, but is present in all vertebrates, and the total number of molecularly defined allelic variants for human MHC-I and MHC-II is in the thousands [7].

The generalized structure of MHC-II is well-understood, including features of peptide binding pockets. In general, MHC-II has alpha and beta subunits which come together to present antigenic peptides for targeting by T cells. Figure 1 depicts a ribbon model of the MHC-II molecule, supporting a model antigenic peptide depicted in a ball-and-stick model form. The two subunits join to form a cup-like structure, which allows the antigenic peptide to be presented outside the cell for T cell action [7].

Although MHC-II molecules can be grouped broadly as a specific class of molecules, there are important subclass distinctions which need to be considered when analyzing MHC-II-antigen binding. Additionally, each subclass performs a slightly different MHC function [9]. This is due to the highly polymorphic and varied characteristics of MHC molecules as a whole. To account for this, this study will impose limits on what molecules will be considered. Firstly, the study will be limited to MHC-II only. This will allow the behavior of MHC-II molecules to be analyzed as a broad category. Although MHC-II molecules are incredibly diverse, there may still be some generic patterns observed. After generalized assessment of MHC-II molecules as a class, analysis of several subtypes of MHC-II will be conducted to determine the differences between disparate types of MHC-II/antigen binding.

When splitting the MHC-II class into subcategories, it is important to consider the limitations of the database in use. This study relies upon the Immune Epitope Database (IEDB) [10], which is capable of splitting assay results by several different experimental factors. Only linear epitopes were considered, as opposed to conformational epitopes, as this eliminates the need for structural data. To further narrow the study, data was limited to human leukocyte antigen (HLA), a specific subclass of MHC-II. Finally, the data was split into positive and negative sets, positive indicating MHC-antigen binding took place and negative meaning no significant binding took place. This was done so unsupervised machine learning (UML) could be used to analyze the data for biases and pattern recognition.

After data was obtained from the IEDB, further preprocessing was conducted. Any epitopes which had additives branching off the main epitope sequence were removed. Initial analysis of the dataset revealed an overabundance of polyamino acids, so any epitopes which had more than 4 duplicate acids in a row were removed to avoid skewing the model. With this, the data is ready to undergo unsupervised assessment.

The model used in this study is known as N-gram analysis, which is a subclass of UML. This analysis type relies on tokenization of a string of data, which is a method of breaking down a long string of data into manageable chunks, which can then be analyzed by the unsupervised learning model. By tokenizing the antigen epitope sequences, the unsupervised learning model can then conduct statistical analysis of the dataset as a whole, and determine any emergent patterns and trends over thousands of instances. In the context of UML, a single data point in a dataset is referred to as an instance, and the variables which distinguish various instances from each other are called attributes.

This study primarily focused on unsupervised analysis of positive binding epitopes. Future work is to include analysis of negative binding, and using the combination of both to construct a supervised model capable of predicting HLA-binding of a novel sequence.

  1. Results

The first stage of analysis focused on observing the generic HLA-MHC-II binding epitope dataset. This dataset underwent unsupervised N-gram analysis to determine the most common N-grams for generic HLA epitopes. Values of N between 1 and 8 were analyzed. From this analysis, a vocabulary graph can be constructed, as depicted in Figure 2. As can be observed from this graph, lower values of N result in a vocabulary plateau. This makes sense, because due to the method of tokenization, there are a limited number of novel tokens which could possibly exist – for example, since there are 20 possible amino acids, there are only 20 possible 1-grams, and 400 possible 2-grams, and so on. As values of N get higher, the plateauing behavior is seen less prominently, which indicates the dataset is not diverse enough to encompass the full scope of possible 8-grams, for example (20^8 possible tokens).

Based on Figure 2, N-grams of N = [1,4] result in a saturated vocabulary. Based on this it can be assumed that the sample size over that range of N is of adequate size for analysis of the dataset at that value of N. For larger values of N, the vocabulary does not reach a saturation plateau before termination of analysis, and therefore it is more likely that novel instances will appear. In light of this, the most significant N-gram sequence predictions of epitope binding will likely come from the range of N = [1,4].

After analysis of the generic HLA-binding epitope dataset, the set was split into disparate subtypes of HLA, which included HLA-DQ, HLA-DR, and HLA-DP. Each of these sets was subsequently analyzed to determine the most likely N-Grams in each set, with values of N = [1,8]. As seen in Figure 3, both HLA-DR and -DP exhibit similar populations of unigrams (1-grams) as the generalized HLA dataset, with the four acids L, A, K and E appearing in the top six most common acids for each subtype. However, HLA-DQ exhibits a drastically different breakdown of common amino acids, with the top six most common acids being A, E, G, S, P and V. This would seem to imply different properties are necessary to bind with HLA-DQ than with other HLA types.

Additional analysis on larger values of N-gram tokens are ongoing for each of these datasets, but from a preliminary standpoint there seem to be certain tokens which appear a statistically significant amount more than random chance would dictate.

  1. Conclusions and Future Work

The most common amino acids in molecules which bind to HLA receptors in general are L, A, K, and E, and it is evident certain types of amino acid are favored for HLA-binding over others. This is an important first step towards allowing the initial stage of MHC-II-targeting drug design to be performed in-silico rather than in-vitro, which could save time and money.

The determination of which single amino acids are most common in HLA-binding proteins is an important first step towards developing an in-silico prediction model for HLA-binding. Future work is to include constructing a natural language processing model which takes in a user-provided amino acid and uses the HLA datasets to predict the configuration of acids most likely to bind to an HLA receptor. In-vitro experimentation will be conducted to verify these results, and to further improve the model’s predictive capabilities.