(218f) Using Machine Learning Approaches to Estimate Enzyme Kinetic Parameters
AIChE Annual Meeting
2022
2022 Annual Meeting
Topical Conference: Applications of Data Science to Molecules and Materials
Applications of Data Science in Molecular Sciences II
Monday, November 14, 2022 - 4:45pm to 5:00pm
The catalytic turnover number (kcat) is a key kinetic property an enzyme defined as the maximal number of molecules of substrates converted to products per active site per unit time. Accurate and generalizable methods for estimating enzyme kinetic parameters such as the one presented here can be invaluable for applications ranging from metabolic modeling to enzyme re-engineering. Enzyme databases such as BRENDA1 contain a repository of turnover numbers measured in vitro. However, the available data is noisy due to various experimental conditions and lack of proper annotations for several entries. We curated a dataset of ~6,000 turnover numbers from BRENDA by applying several quality filters. Using this dataset, we trained a convolutional neural network (CNN) model to learn amino-acid embeddings that accurately estimate kcat values when used as enzyme features along with morgan fingerprints as substrate features. The trained model achieved an average Pearson correlation coefficient of 0.78 (standard deviation 0.008) and an average root mean squared error of 0.96 (standard deviation 0.061) in a 5-fold cross validation evaluation. The root mean squared error of 0.96 in log scale corresponds to less than an order of magnitude error in linear scale which is quite low compared to the overall range of kcat values (1E-06 to 1E+07). The low standard deviations across cross-validation suggests a robust and generalizable training across the entire dataset. The success of our model can be attributed to the ability of CNNs to extract complex local patterns of amino acid residues that may be responsible for actual enzyme-substrate interactions on the molecular level. Comparison of our model to existing methods along with its current limitations and provision for improvements will be discussed. In particular, the use of state-of-the-art protein language model embeddings as features and the use of graph-based architectures that overcome the limitations and CNNs and provide more meaningful insights will be discussed. By training ML models to accurately capture the mapping between amino acid mutations and changes in turnover numbers, they can be used to guide directed evolution and/or targeted enzyme engineering approaches.
References:
- Chang, Antje et al. âBRENDA, the ELIXIR core data resource in 2021: new developments and updates.â Nucleic acids research vol. 49,D1 (2021): D498-D508.