(535h) Activity Coefficient Acquisition with Thermodynamics-Informed Active Learning for Phase Diagram Construction | AIChE

(535h) Activity Coefficient Acquisition with Thermodynamics-Informed Active Learning for Phase Diagram Construction

Authors 

Maginn, E., University of Notre Dame
Knowledge on the activity coefficients of all components in a mixture, particularly their composition and temperature dependances, allows for the full determination of a wide range of thermodynamic properties. However, the experimental or computational determination of activity coefficients is costly, time consuming, and error prone. As such, an intermediate approach is proposed in this work using Gaussian processes (GPs) and active learning. The term “intermediate” is used to signify a model that is not fully predictive (it requires some experimental data for any given system) but not fully correlative either (it requires less data than classical models, while being non-parametric and natively handling experimental uncertainty).

Working with synthetic data generated using the NRTL model on a wide temperature range (250-550 K) for several binary mixtures, the best procedures to describe activity coefficients using GPs are initially assessed, particularly on how to leverage their physical bounds and the best choice of kernels for GP-based regression. It is found that, when the GPs are trained using pure-component virtual points and against the natural logarithm of activity coefficients, 20 data points (equally distributed throughout the full composition and temperature windows) are sufficient to fully correlate the activity coefficients of all components in all systems. Moreover, the GP-predicted activity coefficients are found to simultaneously describe the vapor-liquid equilibrium (VLE), liquid-liquid equilibrium (LLE), and solid-liquid equilibrium (SLE) phase diagrams of all systems involved, even when equilibrium occurs at temperatures outside the region used to train the GPs (extrapolation).

Having demonstrated the ability of GPs to accurately correlate activity coefficient data, active learning algorithms were then developed to construct phase diagrams. The idea behind this strategy is to fit a GP to a limited amount of initial data, compute its standard deviation on a composition/temperature grid, and use a standard-deviation-based metric (known as acquisition function or policy) to select the next points to probe experimentally or computationally. The use of active learning had a tremendous impact on the amount of data needed to describe the phase diagrams of all systems studied. This was most advantageous when target-specific workflows (error-propagation-based acquisition functions and standard deviations based on the estimated phase diagram at each active learning iteration) were designed to construct VLE and SLE phase diagrams, with many cases requiring as little as a single data point to obtain accurate descriptions. Moreover, the accuracy and performance of these phase-diagram-targeted active learning algorithms can be improved by selecting a tighter criterion for the acquisition function, at the expense of more training data being required. In fact, the ability to define a trade-off between accuracy and amount of data is a GP-specific advantage not offered by any other thermodynamic model and is particularly useful when fast screenings of multiple mixtures are desired.

Finally, the active learning algorithms were applied to experimental case-studies, namely the SLE and VLE phase diagrams of deep eutectic solvents (choline chloride/urea, thymol/menthol, and thymol/TOPO), as well as single ion activity coefficients and ternary VLE phase diagrams. Not only was active learning shown to be a valuable tool to minimize the amount of data needed to describe these datasets, but GPs were also shown to be superior to NRTL due to the scarce active-learning-acquired datasets and experimental uncertainty. These results also illustrated many modifications that can be undertaken to tailor both GP formulations and active learning workflows to specific computational or experimental applications, such as different GP inputs, active learning algorithms, and stopping criteria.