(346bl) Understanding the Chemical Machine Learning Design Space Using a Property Graph Database
AIChE Annual Meeting
2020
2020 Virtual AIChE Annual Meeting
Computational Molecular Science and Engineering Forum
Poster Session: Computational Molecular Science and Engineering Forum (CoMSEF)
Wednesday, November 18, 2020 - 8:00am to 9:00am
Machine learning has been used extensively to predict molecular properties and design molecules [1-6]. When initializing a machine learning model, the modeler must make a series of decisions about how the model will operate. One must choose a learning algorithm (i.e random forest vs neural network), a featurization method (how the molecules in the data set will be described to the learning algorithm), training set size, hyper-parameters values, validation method, etc. These decisions are not independent and impact the cost and efficacy of the machine learning model. In this work, we trained a series of machine learning models using a wide gamut of the above parameters. For example, one instance could be a random forest model with 10-fold cross validation and Morgan molecular fingerprint featurization to predict logP. Each model was used to predict molecular properties and was evaluated based on error and prediction uncertainty. The model parameters, performance, and molecular datasets were stored in a property graph database (PGDB). Graph topology algorithms were used to identify model features, including molecular fragments, that most impact a modelâs performance. The PGDB enhances the explainability of machine learning models by enabling visualization and efficient queries of relationships between modeling choices, data, and model performance.