(268b) Automated Outlier Detection and Estimation of Missing Data | AIChE

(268b) Automated Outlier Detection and Estimation of Missing Data

Authors 

Mohan, N., Massachusetts Institute of Technology
Cummings Bende, E. M., University of Massachusetts Amherst
Maloney, A. J., Amgen Inc
Nieves, M., Massachusetts Institute of Technology
Lu, A. E., Massachusetts Institute of Technology
Hong, M. S., Massachusetts Institute of Technology
Wen Ou, R., MIT
Barone, P. W., Massachusetts Institute of Technology
Leung, J. C., Massachusetts Institute of Technology
Braatz, R. D., Massachusetts Institute of Technology
Industrial process datasets commonly have missing values. The distribution of missing data points can be characterized as being within four classes: (1) random missing data, which exhibit no explicit pattern, (2) sensor drop-out, in which the missing values are correlated in time, (3) multi-rate, in which the missing data occur periodically, and (4) censoring, in which there exist thresholds for censoring so that the measurements outside the range are not recorded (Imtiaz and Shah, 2008; Severson et al., 2017).

The presence of missing values inhibits the use of the data in process modeling, analysis, and control. Even basic data analytics methods such as principal component analysis (PCA) and partial least squares (PLS) require a full data matrix, that is, without any missing values. The simplest way to deal with missing values is to only consider the observations with full measurements. Removing every observation with missing values, however, can cause significant data loss in the specific time period when the missing values are agglomerated, which makes capturing the process dynamics challenging. In order to fully utilize the given dataset, many general-purpose methods such as mean imputation, alternating algorithm, PCA data augmentation (Imtiaz and Shah, 2008), Bayesian PCA (Oba et al., 2003), singular value thresholding (Cai et al., 2010), and augmented Lagrange multiplier (Lin et al., 2010) have been developed to filling in missing values in a structured way, which are briefly reviewed in this presentation.

The aforementioned matrix recovery algorithms for filling in missing datapoints, however, are vulnerable to outliers. Not removing the outliers before filling in missing values emphasizes the effect of outliers and degrades the accuracy and reliability of results obtained by subsequent data analytics. These outliers can be detected by using the contribution map for T2 and Q statistics (Miller et al., 1998; Chiang et al., 2000; Zhu and Braatz, 2014).

To the best of our knowledge, there has been no software that simultaneously detects outliers and fills in missing values in an automated way. In this presentation, we introduce an open-source software that automatically detects outliers, fills in missing values, and evaluates each algorithm used for matrix recovery. This presentation describes the framework of outlier detection and missing value estimation used in the software, and the demonstration of the software to data collected from a continuous biomanufacturing pilot facility at the Massachusetts Institute of Technology. The data are from the production of a monoclonal antibody produced by Chinese Hamster Ovary cells in a perfusion bioreactor.

To provide a thorough validation of the methods and software, the software is applied to a variety of datasets are varying types and extent of missing data constructed from an initial dataset in which all of the measurements are initially available. The performance of the various methods for filling in missing values are compared using several metrics including the normalized root-mean-squared error, the number of imputed values outside the boundaries, the number of imputed values considered outliers, and the computational time. Fifty simulations on each missing data scenario were conducted to obtain the distribution of these metrics. The matrix completion methods were the most effective except for the censoring case where probabilistic PCA methods were the most effective.

References

Cai, J. F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982.

Chiang, L. H., Russell, E. L., and Braatz, R. D. (2000). Fault Detection and Diagnosis in Industrial Systems. Springer Science & Business Media, London, U.K.

Imtiaz, S. A. and Shah, S. L. (2008). Treatment of missing values in process data analysis. Canadian Journal of Chemical Engineering, 86(5):838–858.

Lin, Z., Chen, M., and Ma, Y. (2010). The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv:1009.5055.

Miller, P., Swanson, R. E., and Heckler, C. E. (1998). Contribution plots: A missing link in multivariate quality control. Applied Mathematics & Computer Science, 8(4):775–792.

Oba, S., Sato, M.-A., Takemasa, I., Monden, M., Matsubara, K.-I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096.

Severson, K. A., Molaro, M. C., and Braatz, R. D. (2017). Principal component analysis of process datasets with missing values. Processes, 5(3):38.

Zhu, X. and Braatz, R. D. (2014). Two-dimensional contribution map for fault identification. IEEE Control Systems, 34(5):72–77.