(656e) Developing an ML Algorithm to Predict the Aqueous Solubility of Polymers and Organic Compounds
AIChE Annual Meeting
2022
2022 Annual Meeting
Computational Molecular Science and Engineering Forum
Machine Learning for Soft Materials II
Thursday, November 17, 2022 - 4:45pm to 5:00pm
Molecular fingerprints used in this study is categorized in two shape: path-based and circular fingerprint. Topological or path-based fingerprints include combinations of atom types and paths between various atom types. In this type of fingerprint, fragments of the molecule are generated by following a path up to a certain number of bonds within the molecule. Path-based fingerprints hash all branched and linear molecular subgraphs up to a particular size by combining atom types, the atomic number, and aromaticity state with bond types. The Daylight fingerprint is the most well-known example of path-based fingerprints, and the RDKit fingerprint is a relative of the well-known Daylight fingerprint. In this study, a maximum path length of five (RDK5) was used.
Circular fingerprints are generated by considering the âcircularâ environment of each atom up to a given âradiusâ or âdiameterâ from the central atom. Morgan fingerprint, also known as extended-connectivity fingerprints ECFPs, is the most popular circular fingerprint which perceives the presence of specific circular substructures around each atom in a molecule. ECFPs are a method to identify identical molecules that have different atom numberings by representing the number of heavy-atom neighbors, the number of hydrogens, the isotope, and ring information. ECFPs have different types based on selecting different maximum bond lengths or diameters of the circular atom neighborhood where the digit at the end shows the maximum diameter value employed to generate the fingerprint. In this study, a circular fingerprint with a diameter 4 and 6, i.e., ECFP4 and ECFP6 were used.
We train ~2000 organic and polymeric compounds using the Random Forest (RFs) model as the regressors with the average R2 test values around 0.91 and 0.81 and 0.82, respectively for molecular descriptors, path-based and circular fingerprints. The most important features of each method and their impact on the aqueous solubility were investigated in this study.