(268b) Automated Outlier Detection and Estimation of Missing Data

Conference

AIChE Annual Meeting

Year

2023

Proceeding

2023 AIChE Annual Meeting

Group

Topical Conference: Next-Gen Manufacturing

Session

Future of Manufacturing and Emerging Technologies

Time

Monday, November 6, 2023 - 3:55pm to 4:20pm

Authors

Rhyu, J. - Presenter

Bozinovski, D., MIT

Dubs, A. B., MIT

Mohan, N., Massachusetts Institute of Technology

Cummings Bende, E. M., University of Massachusetts Amherst

Maloney, A. J., Amgen Inc

Nieves, M., Massachusetts Institute of Technology

Sangerman, J., MIT

Lu, A. E., Massachusetts Institute of Technology

Hong, M. S., Massachusetts Institute of Technology

Artamonova, A., MIT

Wen Ou, R., MIT

Barone, P. W., Massachusetts Institute of Technology

Leung, J. C., Massachusetts Institute of Technology

Wolfrum, J., MIT

Springs, S., MIT

Braatz, R. D., Massachusetts Institute of Technology

Sinskey, A., MIT

Industrial process datasets commonly have missing values. The distribution of missing data points can be characterized as being within four classes: (1) random missing data, which exhibit no explicit pattern, (2) sensor drop-out, in which the missing values are correlated in time, (3) multi-rate, in which the missing data occur periodically, and (4) censoring, in which there exist thresholds for censoring so that the measurements outside the range are not recorded (Imtiaz and Shah, 2008; Severson et al., 2017).

The presence of missing values inhibits the use of the data in process modeling, analysis, and control. Even basic data analytics methods such as principal component analysis (PCA) and partial least squares (PLS) require a full data matrix, that is, without any missing values. The simplest way to deal with missing values is to only consider the observations with full measurements. Removing every observation with missing values, however, can cause significant data loss in the specific time period when the missing values are agglomerated, which makes capturing the process dynamics challenging. In order to fully utilize the given dataset, many general-purpose methods such as mean imputation, alternating algorithm, PCA data augmentation (Imtiaz and Shah, 2008), Bayesian PCA (Oba et al., 2003), singular value thresholding (Cai et al., 2010), and augmented Lagrange multiplier (Lin et al., 2010) have been developed to filling in missing values in a structured way, which are briefly reviewed in this presentation.

The aforementioned matrix recovery algorithms for filling in missing datapoints, however, are vulnerable to outliers. Not removing the outliers before filling in missing values emphasizes the effect of outliers and degrades the accuracy and reliability of results obtained by subsequent data analytics. These outliers can be detected by using the contribution map for T² and Q statistics (Miller et al., 1998; Chiang et al., 2000; Zhu and Braatz, 2014).

To the best of our knowledge, there has been no software that simultaneously detects outliers and fills in missing values in an automated way. In this presentation, we introduce an open-source software that automatically detects outliers, fills in missing values, and evaluates each algorithm used for matrix recovery. This presentation describes the framework of outlier detection and missing value estimation used in the software, and the demonstration of the software to data collected from a continuous biomanufacturing pilot facility at the Massachusetts Institute of Technology. The data are from the production of a monoclonal antibody produced by Chinese Hamster Ovary cells in a perfusion bioreactor.

To provide a thorough validation of the methods and software, the software is applied to a variety of datasets are varying types and extent of missing data constructed from an initial dataset in which all of the measurements are initially available. The performance of the various methods for filling in missing values are compared using several metrics including the normalized root-mean-squared error, the number of imputed values outside the boundaries, the number of imputed values considered outliers, and the computational time. Fifty simulations on each missing data scenario were conducted to obtain the distribution of these metrics. The matrix completion methods were the most effective except for the censoring case where probabilistic PCA methods were the most effective.

References

Cai, J. F., CandÃ¨s, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956â€“1982.

Chiang, L. H., Russell, E. L., and Braatz, R. D. (2000). Fault Detection and Diagnosis in Industrial Systems. Springer Science & Business Media, London, U.K.

Imtiaz, S. A. and Shah, S. L. (2008). Treatment of missing values in process data analysis. Canadian Journal of Chemical Engineering, 86(5):838â€“858.

Lin, Z., Chen, M., and Ma, Y. (2010). The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv:1009.5055.

Miller, P., Swanson, R. E., and Heckler, C. E. (1998). Contribution plots: A missing link in multivariate quality control. Applied Mathematics & Computer Science, 8(4):775â€“792.

Oba, S., Sato, M.-A., Takemasa, I., Monden, M., Matsubara, K.-I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088â€“2096.

Severson, K. A., Molaro, M. C., and Braatz, R. D. (2017). Principal component analysis of process datasets with missing values. Processes, 5(3):38.

Zhu, X. and Braatz, R. D. (2014). Two-dimensional contribution map for fault identification. IEEE Control Systems, 34(5):72â€“77.

Topics

Process Automation & Control

Computing and Systems Engineering

Measurement and Metrics

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: December 2024

CEP: November 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.

(268b) Automated Outlier Detection and Estimation of Missing Data

AIChE Annual Meeting

2023

2023 AIChE Annual Meeting

Topical Conference: Next-Gen Manufacturing

Future of Manufacturing and Emerging Technologies

Monday, November 6, 2023 - 3:55pm to 4:20pm

Authors

Topics

More Conference Links

Visit Orlando

Universal Studios Offer

Cancellation Policy

Code of Conduct

Beware of Hotel and Attendee-list Scams