(104g) Automated Dimension Reduction with Principal Component Analysis Using Area Under Curve | AIChE

(104g) Automated Dimension Reduction with Principal Component Analysis Using Area Under Curve

Authors 

Chowdhury, D. - Presenter, Oak Ridge National Laboratory
McClain, M., Purdue University
Villez, K., Oak Ridge National Laboratory
The advancement of computational power and sensing technology has led to an increase in the collection of large data-sets across many disciplines. However, the high dimensionality makes data visualization difficult and therefore limits the analysis and exploration of data. Principal component analysis (PCA) [1] is a popular tool used to reduce the data's dimensionality while capturing most of the variability. The directions in which the variation in the data are maximal are called principal components (PCs). The choice in the number of PCs retained in a PCA model is a hyper-parameter that needs to be tuned for applications in anomaly detection and process monitoring. Ultimately, the number of retained PCs should match the true number of latent variables of the data, which is not always known, to provide the best anomaly detection. In manufacturing systems, anomaly detection algorithms are used to monitor the printing process of a part and detect anomalies. These algorithms' ability to distinguish between normal and anomalous data depends on the optimal number of PCs chosen in a PCA model.

Optimal selection of the model hyper-parameter has been proposed in many unsupervised learning algorithms, including the Akaike information criterion [2], Kaiser rule [3], parallel analysis [4], and the variance of reconstruction error (VRE) [5]. These methods assume: (i) linear dependence of the measurements on an unknown number of latent variables and (ii) measurement errors are sampled independently from the same univariate normal distribution. A cross-validated ignorance score [6] that interprets PCA as a density model, as in probabilistic principal component analysis (PPCA) [7], was proposed to find the optimal number of latent variables in PCA. However, this method assumes, as in PPCA, that the latent variables are normally distributed.

In this work, we present an automated latent variable model selection in PCA without assuming that the latent variables are normally distributed. To this end, the data X ∈ R m x n, where m is the number of samples and n is the number of variables, is split into calibration Xc ∈ R mc x n and validation Xv ∈ R mvx n blocks, where mc + mv = m. We then create a randomized copy of the validation data block by random permutation of the row position of the elements, this for each column separately. This randomized data block is called the faulty data block defined by Xf ∈ R mvx n. The original validation data block contains normal data, which a chosen model should be able to describe adequately. In contrast, the faulty data block represents anomalous data that a suitable PCA model will not describe. This enables the transformation of an unsupervised learning problem into a supervised one. The model is calibrated using the calibration data with a given number of PCs. The PCs are calculated using the singular value decomposition of the mean-centered covariance matrix of the calibration block provided by YmT * Ym/(mc-1)=UΣUT, where Ym = Xc -1*MT; 1 is a column of all 1’s; M is the column wise mean vector; U is the principal scores; Σ is a diagonal matrix containing all nonzero singular values (variances), ordered from largest to smallest. A model with K principal components can then be chosen as PCK=U(1:K), where U(1:K) represents the first K columns of U. We then compute the value of the Q statistic [8] for the data samples in the validation and faulty blocks. The Q statistic is calculated by first obtaining an approximation of the validation and faulty block by applying a projection and reconstruction step in sequence given by Yv*PCK*PCKT and Yf*PCK*PCKT, where Yv = Xv -1*MT and Yf = Xf -1*MT. Then, the Q statistic for the validation and faulty block for each sample is given by

Qv(i) = Ev(i)* EvT(i) and Qf(i) = Ef(i)* EfT(i), for i = 1,..., mv

where Ev(i) and Ef(i) denote the ith rows of the residual error matrices Ev = Yv - Yv*PCK*PCKT, and Ef = Yf - Yf*PCK*PCKT.

A receiver operator curve (ROC) [9] is constructed by plotting the true positive rate (TPR), computed with the faulty data block, against the false positive rate (FPR), computed with the validation block. The ROC curve is constructed by computing the TPR and FPR values, as follows:

  1. Generate a 2*mc dimensional vector Q by stacking Qv and Qf atop each other.
  2. Sort the values in Q in ascending order.
  3. Choose an upper control limit as QUCL = Q(i). The FPR is computed as the fraction of values in Qv that are higher than QUCL, and the TPR is computed as the fraction of values in Qf that are higher than QUCL. The computed (FPR, TPR) pair is a single point on the ROC. The complete ROC is generated by a sweep of the upper control limit, i.e., QUCL=Q(i) for i=1,....,2*mc.
  4. Obtain the ROC by plotting TPR as a function of the FPR.

Once constructed, we choose the optimal number of latent variables (K) by comparing the ROCs available for every candidate K's. To do this automatically, we calculate the area under the curve (AUC) [9], which is computed as the integral of the piece-wise constant function defined by the FPR-TPR pairs representing the ROC curve. The AUC serves as a metric on a chosen model's performance to separate anomalous or faulty data samples from normal ones. The model for which the AUC has the maximum value above 0.5 (which represents a random classifier) has the best classifier performance. The number of PCs in that model is, therefore, the chosen number of latent variables.

To show the algorithm's efficacy, we use the simulated data class B1 given in [6]. We use the data types B1.01- B1.06 in the data class B1. For this data set, the known number of true PCs is three, and the number of variables is nine. The data type B1.01- B1.06 are simulated with a covariance matrix composed of blocks, each of which has the same value in all row-column positions. The covariance matrix blocks are chosen such that there is a gradual decrease in the fraction of non-noisy variance to the total variance from data type B1.01 to B1.06. We run 100 iterations of the algorithm for each data type. The accuracy is defined as the fraction of the number of instances of a data set for which the identified number of PCs matches the ground truth value exactly. In our simulation results, the proposed algorithm selects the optimal number of latent variables with 97% accuracy for the data types B1.01- B1.06.

References

[1] Jolliffe, Ian T., and Jorge Cadima. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374.2065 (2016): 20150202.

[2] Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings 2nd International Symposium on Information Theory; Petrov and Caski, Eds., 1974; pp 267-281.

[3] Kaiser, H. F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20 (1), 141-151.

[4] Zwick, W. R.; Velicer, W. F. Comparison of five rules for determining the number of components to retain. Psychol. Bull. 1986, 99 (3), 432-442.

[5] Valle, Sergio, Weihua Li, and S. Joe Qin. Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Industrial & Engineering Chemistry Research 38.11 (1999): 4389-4401.

[6] Russo, Stefania, Guangyu Li, and Kris Villez. Automated model selection in principal component analysis: a new approach based on the cross-validated ignorance score. Industrial & Engineering Chemistry Research 58.30 (2019): 13448-13468.

[7] Tipping, E.; Bishop, C. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1999, 61, 611−622.

[8] Montgomery, D.C., 2001. Introduction to Statistical Quality Control, fourth ed. Wiley, New York, NY

[9] Fawcett, Tom. An introduction to ROC analysis. Pattern recognition letters 27.8 (2006): 861-874.