(346bl) Understanding the Chemical Machine Learning Design Space Using a Property Graph Database | AIChE

(346bl) Understanding the Chemical Machine Learning Design Space Using a Property Graph Database

Authors 

Luxon, A. - Presenter, Virginia Commonwealth University
Le, Q., Virginia Commonwealth University
Ferri, J. K., Virginia Commonwealth University
McQuade, T., Virginia Commonwealth University
Machine learning has been used extensively to predict molecular properties and design molecules [1-6]. When initializing a machine learning model, the modeler must make a series of decisions about how the model will operate. One must choose a learning algorithm (i.e random forest vs neural network), a featurization method (how the molecules in the data set will be described to the learning algorithm), training set size, hyper-parameters values, validation method, etc. These decisions are not independent and impact the cost and efficacy of the machine learning model. In this work, we trained a series of machine learning models using a wide gamut of the above parameters. For example, one instance could be a random forest model with 10-fold cross validation and Morgan molecular fingerprint featurization to predict logP. Each model was used to predict molecular properties and was evaluated based on error and prediction uncertainty. The model parameters, performance, and molecular datasets were stored in a property graph database (PGDB). Graph topology algorithms were used to identify model features, including molecular fragments, that most impact a model’s performance. The PGDB enhances the explainability of machine learning models by enabling visualization and efficient queries of relationships between modeling choices, data, and model performance.