(108h) High-Affinity Non-Immunoglobin Scaffolds Promoted By Sequence Mining Impactful in High-Value Therapeutics
AIChE Annual Meeting
2021
2021 Annual Meeting
Topical Conference: Applications of Data Science to Molecules and Materials
Applications of Data Science to High Throughput Experimentation
Monday, November 8, 2021 - 2:15pm to 2:30pm
Proteins are crucial elements in myriad of biological processes, and they have high value in therapeutics and diagnostics. The global protein engineering market has been predicted to reach $3.9 billion by 2024, highlighting the need for grasping new technologies and opportunities in the field. As a result, for joining this rapidly evolving and valuable campaign, decent understanding on the general regulations and limitations of proteinsâ realm is needed to ultimately boost or modify their functions by appropriate mutation(s). There are 20 natural amino acids which are building blocks of the proteins combining to each other and forming a relatively long string. This represents an astronomical number of possible mutations even for a small protein with 50 amino acids (2050 possibilities). The scarcity of functional proteins and the ease with which deleterious mutations are introduced highlights the need for auxiliary tools in current techniques such as rational design and directed evolution. Recently, machine learning (ML) has been strikingly impactful in protein engineering offering unique advantages such as multitasking and parallelism, high adaptation to complex systems, and feature extraction. The intricate sequence-structure-function paradigm in proteins opens the possibility for various directions in protein engineering from protein design to protein structure prediction. Therefore, ML can be applied in various tasks for predicting both structural and functional attributes of proteins (e.g. secondary structure, stability, and catalytic activity prediction). The number of solved structures from protein sequences, albeit rapidly increasing, are not comparable with the number of growing sequences in protein databases. Hence, there is a growing tendency to map the sequence to function, circumventing structural analysis and predictions, via machine learning techniques. Importantly, mapping the sequence to function for a protein of interest requires a supervised learning algorithm (each protein sequence being labeled with its fitness toward the desired functionality). This goal might be feasible after carrying out high-throughput wet-lab experiments and trying to sample the protein fitness landscape by generating as much labeled data as possible. However, a lot of useful information in unlabeled data will be ignored by this route. Despite progress in biomedical and industrial applications using carefully engineered proteins, a deep understanding for the effect of individual amino acids within a protein and the interconnected relationship of amino acids throughout a protein sequence is still missing. Considering each amino acid as a single word in the language of proteins, natural language processing (NLP) is a promising route to capture the nuances (e.g. epistatic relationships and statistical analysis of amino acids). NLP models can then be built based on the publicly available and rapidly growing databases of annotated protein sequences in order to learn the intricacies of functional amino acid sequences in the context of a protein of interest. The provides the opportunity to then build supervised models for fitness prediction in accordance with the previously trained NLP model. In this study, we implement an NLP based method for binding affinity prediction of small non-immunoglobulin scaffolds against biomarkers relevant to treating and diagnosing human disorders. This promising approach will give rise to improved prediction and discovery of high-fitness small scaffolds to aid existing therapeutic and diagnostic technologies.