(266f) Fault Tolerant Computing through Machine Learning

Conference

AIChE Annual Meeting

Year

2016

Proceeding

2016 AIChE Annual Meeting

Group

Computing and Systems Technology Division

Session

Advances in Computational Methods and Numerical Analysis

Time

Tuesday, November 15, 2016 - 10:00am to 10:18am

Authors

Sroczynski, D. - Presenter, Princeton University

Kyauk, C., Princeton University

Kevrekidis, I. G., Princeton University

Villoutreix, P., Princeton University

Anden, J., Princeton University

In modern, massively parallel scientific computation, domain decomposition approaches lead
to different segments of a domain, and different subfields/equations solved for in each segment,
being computed on different processors.
If a processor fails during a "computation era", before information is exchanged between nodes,
one is faced with a serious problem about if and how the computation can proceed.

In many cases, the different fields that these processors compute are all functions of some
intrinsic lower-dimensional coarse variables (e.g., time during the computation, long-wavelength
features of the solution).
If the computational algorithms share some such common information,
we can use machine learning, and in particular diffusion maps, a nonlinear manifold learning algorithm, to
``register" the computational data in the coarse space and to ``fill in", to the best of our ability, data that
are missing or corrupted because of a processor failure.

This allows us to learn functional relationships between aspects of the data fields that are
not common across processors, effectively fusing the data sets.

We demonstrate our approach on two illustrative PDE systems with various spatiotemporal patterns of missing data.

The approach meshes well with equation-free computation schemes, in particular with patch dynamics;
beyond helping to partially restore corrupted or missing data, it can help determine
the size of the computational domain over which simulations need not be performed,
and can help determine processor redundancy for different anticipated failure patterns.

This is joint work with Prof. G. Karniadakis and Dr. Seungjoon Lee at Brown University.

Topics

Computing and Systems Engineering

Measurement

Sensors

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2024 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: September 2024

CEP: August 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.