(374s) Multiple Particle Tracking Data Analyzed with Machine Learning to Predict Complex Biological Variables | AIChE

(374s) Multiple Particle Tracking Data Analyzed with Machine Learning to Predict Complex Biological Variables

Authors 

Schimek, N. - Presenter, University of Washington
McKenna, M., University of Washington
Beck, D., University of Washington
Nance, E., UNIVERSITY OF WASHINGTON
Background: Multiple particle tracking (MPT) is a technique capable of tracking the individual motion of hundreds to thousands of particles simultaneously while maintaining single particle resolution. MPT has been instrumental in the microrheological characterization of several biological environments, including the vitreous of the eye, mucosal membranes, intracellular environments, and the brain extracellular space. MPT has also been used to understand the behavior of nanoparticle-based therapeutics, and viral vectors. A single MPT experiment can include tracking of 102 to 105 total trajectories, generating many gigabytes of data. The significant data generation makes MPT a prime candidate to benefit from advancements in machine learning and artificial intelligence. Early work in this space has demonstrated machine learning can predict the motion type of nanoparticles using random forests1, and agarose gel stiffness and in vitro cell uptake of nanoparticles using artificial neural networks2. More recently, a boosted decision tree model was used to predict biological age of rodents from nanoparticle diffusion data and utilized SHAP values to determine the key statistical features3.

Current research seeks to train explainable machine learning (ML) models on data from MPT experiments to probe the effects of neurological disease and injury conditions on the brain extracellular matrix. However, there are currently no standardized, reproducible workflows for training ML models on MPT data derived from biological tissue and using explainable machine learning to connect results back to biological changes within the tissue. To address this, we develop a robust, reproducible, and modular pipeline capable of extracting biological insights from MPT datasets using explainable ML.

Methods: The initial step of the pipeline is to run the data through diff_viz, a lab developed software package designed for visualizing MPT data. Diff_viz serves as an initial filter to ensure the MPT data being used for machine learning was collected and processed properly, and that the data is representative of particles moving through the brain extracellular space via Brownian diffusion. Visualizations of the mean squared displacement (<MSD>) vs time of nanoparticle diffusion are plotted for each MPT video, to determine whether there are sources of static or dynamic error. Visualizations of diffusion modes across different videos and experiments ensures that not all particles are moving superdiffusively, a sign of active transport as opposed to Brownian diffusion, or all moving sub-diffusively, a sign that particle were taken up by cells. Diff_viz can also be used as for exploratory data analysis, plotting the distribution of statistical features across different biological classes, and using Principal Component Analysis (PCA) to plot the distribution of all data from each class in two dimensions.

Next, we apply XGBoost (eXtreme Gradient Boosting), a boosted decision tree model, to train on the data and generate the predictions. An XGBoost model trained on all features is compared to a baseline model trained only on the effective diffusion coefficient at 0.33 and 0.67 seconds to ensure that additional statistical features add predictive value. Once an accurate model is trained, the Shapley Additive exPlanations4, 5 package finds the features that most significantly drive model accuracy, which can provide insight into the biological changes causing differences between the classes. To ensure that predictions and SHAP values reflect the model learning real underlying patterns in the data, we apply a Y-scrambling approach to determine how the model would perform if the data was random.

After the initial set of predictions, we employ a “leave one out” method to evaluate data quality of each MPT video, as well as each nanoparticle trajectory. All nanoparticle trajectories from a single MPT video are held out, and 15 XGBoost models are each trained on a random 10% subset of remaining dataset. Each of the 15 models then make predictions on the held-out nanoparticle trajectories, and those predictions are averaged to estimate how likely a trajectory is to be predicted accurately. The process is repeated for unique MPT video in the dataset, resulting in an estimation of the quality of each MPT video, and each nanoparticle trajectory within each video.

Diff_viz is then used to visualize MSD vs time plots, diffusion mode distribution plots, and feature distribution plots of high and low value MSD videos. For each video, plots of the individual trajectories can be made to visually inspect high and low value trajectories. Additionally, every nanoparticle trajectory is visualized on the same plot at their respective positions in the MPT video, and a heatmap based on the estimated likelihood of being predicted properly is used to color each trajectory to investigate localized areas of high or low value trajectories.

Results: To determine the efficacy of the pipeline, we applied the pipeline to three MPT datasets generated in rodent brain tissue where changes to the extracellular space and extracellular matrix (ECM) were validated through fluorescent imaging of key proteins of the brain ECM. The datasets used include the dataset originally published by McKenna et al where the classes are five age groups3, a dataset where the classes are five different brain regions, and a dataset from enzyme induced ECM breakdown experiments from McKenna et al3. These three datasets were chosen as they are frequently variables of interest for neuroscience research and are from biological environments where the underlying changes are well known. Ensuring that the pipeline can detect changes in nanoparticle trajectories due to brain age, brain region, and experimentally induced ECM breakdown is crucial before applying the pipeline to more complex biological environments, such as models of neurodegenerative disease. Trained XGBoost models were able to successfully predict on the age and region datasets well above random guessing and both baseline models, and the combinations of most important featured calculated by SHAP were different between both datasets. However, the model trained on the enzymatically induced ECM breakdown dataset did not have an accuracy above simple baseline model, and analysis of SHAP values showed that the only one features, the mean diffusion coefficient, provided useful information to the model.

We then apply the pipeline to two models of neurodegenerative disease in the rodent brain that have been developed in the Nance lab – an oxygen glucose deprivation (OGD) model and a rotenone (ROT)-treated model. For both conditions, we demonstrate that the pipeline can accurately distinguish data from treated brain tissue versus non-treated control tissue for each condition, mimicking the prediction of disease onset (Fig. 1A, Fig 1B). We then show that the pipeline can be used to predict varying severities of each condition at a rate above random guessing (Fig. 1A, Fig. 1B). Finally, we find SHAP values for each condition and each level of severity to determine whether the models learn different key features combinations across the two different conditions (Fig. 1C). The most influential SHAP features for each condition exhibit that an XGBoost model uses unique combinations of features for each condition and level of severity, indicating that that the model can detect condition and severity specific trends on the data.

Conclusion: Overall, our findings suggest that XGBoost can be used to predict a wide variety of complex biological variables from MPT data well above random guessing, and paired with SHAP can determine unique subsets of statistical features used to predict specific biological variables. The flexibility and accuracy of the pipeline across different MPT datasets demonstrates that the pipeline can be readily applied to new MPT datasets to gain insights into changes to the biological environment of interest. Future work building from these results will investigate how the most important SHAP features relate to cellular and proteomic changes within the biological environment.

References


  1. Wagner, T.; Kroll, A.; Haramagatti, C. R.; Lipinski, H.-G.; Wiemann, M., Classification and Segmentation of Nanoparticle Diffusion Trajectories in Cellular Micro Environments. PLOS ONE 2017, 12 (1), e0170165.
  2. Curtis, C.; McKenna, M.; Pontes, H.; Toghani, D.; Choe, A.; Nance, E., Predicting in situ nanoparticle behavior using multiple particle tracking and artificial neural networks. Nanoscale 2019, 11 (46), 22515-22530.
  3. McKenna, M.; Shackelford, D.; Ferreira Pontes, H.; Ball, B.; Nance, E., Multiple Particle Tracking Detects Changes in Brain Extracellular Matrix and Predicts Neurodevelopmental Age. ACS Nano 2021, 15 (5), 8559-8573.
  4. Shapley, L. S., A Value for N-Person Games. RAND Corporation: Santa Monica, CA, 1952.
  5. Lundberg, S. M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J. M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I., From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2020, 2 (1), 56-67.