(362e) Calculating Pairwise Similarity of Polymer Ensembles Via Earth Mover's Distance | AIChE

(362e) Calculating Pairwise Similarity of Polymer Ensembles Via Earth Mover's Distance

Authors 

Shi, J. - Presenter, University of Notre Dame
Audus, D. J., University of California, Santa Barbara
Olsen, B., Massachusetts Institute of Technology
Synthetic polymers, in contrast to small molecules with well-defined molecular structures and biomacromolecules with defined sequences, are typically ensembles comprising polymer chains with varying lengths, sequences, compositions, topologies, and amounts. While numerous approaches exist for measuring pairwise similarity among small molecules and sequence-defined biomacromolecules, accurately determining the pairwise similarity between two polymer ensembles remains a challenge. A common method involves generating embedding vectors for each polymer chain in the ensemble, then building a single embedding vector for each ensemble by averaging every polymer chain's embedding vector and subsequently computing the similarity between these ensemble vectors, yielding a similarity score. However, this commonly used average-based approach prematurely reduces the dimensionality of the ensemble and fails to distinguish significant differences between polymer ensembles in some scenarios.

In this work, we propose the earth mover's distance metric to calculate the pairwise similarity score between two polymer ensembles. First, the individual pairwise distance dij between a polymer chain pi in the polymer ensemble P and a polymer chain qj in the polymer ensemble Q is calculated via commonly used similarity methods, such as computing the sequence mismatching percent or graph edit distance between two single polymer chains, generating a distance matrix D = [dij]. Second, earth mover's distance utilizes the distance matrix D and the relative weights of each polymer chain to quantitatively calculate the pairwise similarity/dissimilarity between two polymer ensembles by linear optimization. We illustrate the power of using earth mover's distance to characterize the pairwise similarity score between polymer ensembles in four examples, including two-chain copolymer ensembles, first-order Markov linear copolymer ensembles, nonlinear star polymer ensembles with varying arm-length, topology and composition, and polymer ensembles represented by molecular mass distributions. These examples demonstrate that the earth mover's distance captures differences neglected by the average method and offers greater resolutions of chemical distinctions between polymer ensembles. With no supervision, the use of earth mover's distance metric gives a quantitative and reliable numeric calculation of pairwise similarity between two polymer ensembles. Our methodology represents a critical advancement in the quantitative calculation of polymer ensemble similarity and accelerates the progress of cheminformatics for polymers. This advancement is promising for applications such as search queries in polymer databases and polymer inverse design, fostering a more comprehensive understanding and utilization of polymer data.