(169bz) Accelerating Polymer Informatics Via Polymer Similarity | AIChE

(169bz) Accelerating Polymer Informatics Via Polymer Similarity

Authors 

Shi, J. - Presenter, University of Notre Dame
Audus, D. J., University of California, Santa Barbara
Olsen, B., Massachusetts Institute of Technology
Developing accurate similarity algorithms among chemical entities is an essential task in polymer informatics, facilitating necessary functions such as regression to predict polymer properties, classification to identify promising candidates and ranking of polymer database search results. Unlike small molecules with well-defined molecular structures and biomacromolecules with specific sequences, synthetic polymers are more complex, typically characterized by ensembles of polymer chains that exhibit variations in length, sequence, composition, topology, and quantity. Since the existing similarity methods used for small molecules and sequence-defined biomacromolecules require well-defined deterministic structures, they cannot be directly utilized for polymer similarity calculations. Therefore, despite its importance, pairwise chemical similarity for polymers remains an open problem.

Even if the exact distributions are not known, the similarity for a synthetic polymer can still be computed. First, the synthetic polymer is represented by BigSMILES, a structurally-based line notation for describing macromolecules. Then, this BigSMILES string is canonicalized and converted to a graph-based representation. Next, the stochastic graph representation is separated into three parts: repeat units, end groups, and polymer topology. The earth mover’s distance is utilized to calculate the similarity of the repeat units and end groups, while the graph edit distance is used to calculate the similarity of the topology. These three values can be linearly or nonlinearly combined to yield an overall pairwise chemical similarity score for polymers that is largely consistent with the chemical intuition of expert users and is adjustable based on the relative importance of different chemical features for a given similarity problem. [Shi et al. Macromolecules 2023, 56, 18, 7344-7357]

When polymers are represented as an ensemble where its distributions are characterized, earth mover's distance metric is proposed to calculate the pairwise similarity score between two polymer ensembles. The power of using earth mover's distance to characterize the pairwise similarity score between polymer ensembles is illustrated in four examples, including two-chain copolymer ensembles, first-order Markov linear copolymer ensembles, nonlinear star polymer ensembles with varying arm-length, topology and composition, and polymer ensembles represented by molecular mass distributions. These examples demonstrate that the earth mover's distance captures differences neglected by the average method and offers greater resolutions of chemical distinctions between polymer ensembles. With no supervision, the use of earth mover's distance metric gives a quantitative and reliable numeric calculation of pairwise similarity between two polymer ensembles. [Shi et al. ACS Polymers Au 2024, 4, 1, 66–76]

Our similarity methods can handle sequential, compositional, molecular mass, and topological differences, which are typically either entirely or partially ignored by traditional methods, enabling distinctions between different ensembles that would otherwise be erroneously determined to be identical. In conclusion, our similarity methodology represents a critical advancement in the quantitative calculation of polymer similarity and accelerates the progress of cheminformatics for polymers including for applications such as property prediction and classification.