(291a) Naming, Classifying, and Comparing Polymers in the Era of Data Science | AIChE

(291a) Naming, Classifying, and Comparing Polymers in the Era of Data Science

Authors 

Olsen, B. - Presenter, Massachusetts Institute of Technology
Describing the chemistry of polymer materials is an extremely difficult challenge due to the fact that a polymer is actually an ensemble of molecules assembled by stochastic reactions, making it difficult to fit neatly into frameworks that have largely developed around the concept of molecules with deterministic structure. These challenges have been exacerbated by the advent of data sciences, necessitating schemes for naming and processing polymer structures that are interoperable between humans and machines to fully take advantage of developments in new algorithms, data models, and analysis tools. Many of these challenges have been addressed by line notations and associated chemoinformatics tools in the small molecule literature, and extensions to polymers promise a similarly large impact on our capabilities.

Recently, we developed BigSMILES, a stochastic line notation capable of capturing polymer structures in a way directly analogous to chemical structure drawings but offering all the advantages of and full compatibility with the SMILES small molecule line notation. However, BigSMILES, like chemical structure drawings, only defines the set of possible molecules. To define their probabilities, characterization data is necessary. To address this, we have put forward the PolyDAT schema that links characterization to line notation, providing complete chemical definition of a polymer. Together, these structures enable many exciting challenges to be addressed. First, we demonstrate how polymer structures can be canonicalized, both using empirical rules and through analogy to automata in computer science. Second, we show how BigSMILES can be used to drive polymer vectorization, and third, we show how BigSMILES can form the basis of polymer similarity comparisons.

Extending the initial BigSMILES grammar, we have also developed BigSMARTS, an extension of SMARTS that allows search of polymer structures. We have further demonstrated that BigSMILES is compatible with the concepts put forth in SELFIES, enabling polymers to be written in a way that makes them more amenable to use in genetic algorithms. Finally, the stochastic nature of BigSMILES makes it inherently compatible with non-covalent bonds, an advantage over deterministic line notations. We use this feature to extend BigSMILES to a wide variety of molecular constructs useful in colloidal and supramolecular materials.

Topics