(84bi) Increasing Data Collection Efficiency through Incorporation of Derivative and Uncertainty Information into Gaussian Process Regression | AIChE

(84bi) Increasing Data Collection Efficiency through Incorporation of Derivative and Uncertainty Information into Gaussian Process Regression

Molecular simulations generate vast amounts of information, including molecular-level details that, when appropriately averaged, provide estimates of structural, thermodynamic, and transport properties of materials and fluids. Determining the behavior of these properties as a function of state conditions or other adjustable simulation parameters, such as those related to the force field, requires many simulations across state/parameter space. While this is typically a costly endeavor, Gaussian Process Regression (GPR) techniques have recently shown promise as robust models for predicting property behavior over a given state/parameter space and efficiently directing new data collection based on active learning. We show that this automated data collection process can benefit significantly from standard uncertainty estimates, as well as derivative information. It is best practice to include the former in any simulation results, while the latter is often-neglected but can be straight-forwardly obtain from statistical mechanical relations. As an example, we highlight the collection of equation of state data, where relationships between derivative information and property fluctuations are familiar. In this case, both simulations and experiments produce derivative information that can be beneficially incorporated into GPR models. However, we reveal that it is important to assign uncertainty estimates to this information, as certainty is expected to change drastically and systematically with derivative order (e.g., uncertainties in heat capacities may be much higher than those in average energies given a fixed amount of simulation time). We provide an extensible code-base incorporating derivative information and uncertainties into GPR models and demonstrate how this can be used to more efficiently collect both simulation and experimental data. Finally, we will demonstrate how active learning routines can be viewed as improved versions of common data collection algorithms, presenting, as an example, an adaptive, uncertainty-controlled version of Gibbs-Duhem integration.