(169cx) Developing an Open-Source Tool for Generating Rich and Consistent Sigma Profiles | AIChE

(169cx) Developing an Open-Source Tool for Generating Rich and Consistent Sigma Profiles

Authors 

Salih, F. - Presenter, Texas A&M University at Qatar
The field of computational material discovery using machine learning has seen a lot of development in terms of novel neural network architectures, advanced training protocols, and novel molecular descriptors. However, the latest of these has an unsolved issue. All the commonly used molecular descriptors (SMILES strings, graphs, positional coordinates, etc.) share the problem of being dependent on molecule size. However, for most deep learning models, the size of the input has to be fixed to the size of the largest molecule in the dataset. Meaning, the network will end up with many unnecessary parameters and will require a larger training dataset. Additionally, the neural network that may have taken months to train will have an upper limit for the size of molecule it can be used with in the future.

As an alternative, sigma profiles have been proposed as a universal molecular descriptor. A sigma profile is an un-normalized histogram of the surface screening charge distribution of a molecule when embedded in a continuum solvent with a very high dielectric constant. Therefore, sigma profiles contain valuable information about the possible inter- and intra-molecular interactions the molecule can experience. And they have been successfully used as inputs to convolutional neural networks and gaussian processes to predict bulk material properties (e.g. boiling point, density, aqueous solubility, etc.) with minimal datasets of less than 1500 molecules. Furthermore, because they are histograms, their size is independent of molecule size - the sigma profiles for methane and C30 will both have the same number of bins. The main reason sigma profiles have not been used as molecular descriptors in more machine learning applications so far is the lack of an open-source tool for generating them, and the publication limits placed by commercially available tools.

This work summarizes the development of an open-source python tool for generating sigma profiles to allow for their use in large-scale materials discovery. The work, also, addresses the effect of conformers on sigma profiles to ensure the generated profiles are consistent and rich with information. The quality or "richness" of the sigma profiles produced was benchmarked against datasets from literature by comparing the performances of Gaussian process models at predicting material properties with the different sigma profile sources.