(747l) Identifying Equilibrated Simulation Trajectories with Artificial Neural Networks
AIChE Annual Meeting
2017
2017 Annual Meeting
Computational Molecular Science and Engineering Forum
Data Mining and Machine Learning in Molecular Sciences II
Thursday, November 2, 2017 - 5:27pm to 5:39pm
Determining which microstates to include in an ensemble average is a ubiquitous simulation task whose reproducibility is hampered by being underspecified. For a single simulation, the task is not so bad: Including or omitting one microstate from a sufficiently large set won't change the answer, so heuristics like "The potential energy is higher here, so we'll cut those states out" will do. The problem arises when thousands of simulation trajectories need to be analyzed, demanding efficient heuristics that are automatable and deterministic. Artificial neural networks are one type of machine learning algorithm that provides a reproducible path for applying pattern recognition heuristics. Here we use the TensorFlow machine learning library to train an artificial neural network on the problem of distinguishing "equilibrated" from "not equilibrated" hypothetical observation sequences. We generate training populations and test populations of normally-distributed observation sequences with embedded linear and exponential correlations. We train a two-neuron artificial network to distinguish the correlated and uncorrelated sequences. We characterize the predictive capabilities of the network as a function of training population size, the number of points in each sequence, and the number of training rounds performed. We find diminishing returns on predictive accuracy with sample sizes over 512 points, training populations of 8000 sequences, and 2000 training rounds, which takes about three minutes to train. We find that this simple network is powerful enough for over 98% accuracy in identifying exponentially-decaying observation sequences in microseconds. These observations suggest neural networks as an effective tool for automating this and other heuristic bottlenecks in high-throughput simulation pipelines.