(360s) Positive Unlabeled Learning of Peptide Properties | AIChE

(360s) Positive Unlabeled Learning of Peptide Properties

Authors 

White, A., University of Rochester
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods [1, 2]. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover unknown peptide sequences that are likely to map to certain antimicrobial properties. In particular, positive unlabeled (PU) learning methods [3] effectively remove the dependence of deep learning classifiers on the negative training set, as they only requires a set of positive examples (confirmed sequence properties) along with an unlabeled set (the unknown candidate peptide sequences). We propose a two-step technique, where step 1 handles the deficiency of negative training examples by extracting a subset of the unlabeled data points that can be confidently labeled as negatives (i.e. reliable negatives). Subsequently, step 2 involves training a deep neural network classifier using the positive and the extracted reliable negatives, and applying it to the remaining pool of unlabeled examples. Our deep learning models share a common architecture of bidirectional recurrent neural networks to enhance the learning of bidirectional dependence between N-terminal and C-terminal amino acid residues [4]. We evaluate the predictive performance of our PU learning method and show that it can achieve competitive classification accuracy when compared with the state-of-art positive-negative (PN) classification approaches.

References:
[1] Song, Hyebin and Bremer, Bennett J and Hinds, Emily C and Raskutti, Garvesh and Romero, Philip A, “Inferring protein sequence-function relationships with large-scale positive-unlabeled learning”, Cell Systems, 2021, 12, 92–101.

[2] Jowkar, Gholam-Hossein and Mansoori, Eghbal G, “Perceptron ensemble of graph based positive-unlabeled learning for disease gene identification”, Computational biology and chemistry, 2016, 64, 263–270.

[3] Bekker, Jessa and Davis, Jesse, “Learning from positive and unlabeled data: A survey”, Machine Learning, 2020, 109, 719–760.

[4] Ye, Yilin and Wang, Jian and Xu, Yunwan and Wang, Yi and Pan, Youdong and Song, Qi and Liu, Xing and Wan, Ji, “MATHLA: a robust framework for HLA-peptide
binding prediction integrating bidirectional LSTM and multiple head attention mechanism”, BMC bioinformatics, 2021, 1–12.

Topics