(185a) Data Fusion and Feature Selection for Process Monitoring at Hanford
AIChE Annual Meeting
2022
2022 Annual Meeting
Computing and Systems Technology Division
Data Science/Analytics for Process Applications
Monday, November 14, 2022 - 3:30pm to 3:49pm
The effective utilization of data has become more important as processes are providing more information with Industry 4.0 initiatives [1]. One such application is real-time monitoring in a vitrification process plant in Hanford in Washington State. This site removes water and radionuclides from radioactive waste, before vitrifying it for long term storage. The process will have inconsistent feed streams, leading to the need for real-time process monitoring. However, a single in-line instrument has limited ability to measure the 25 chemical constituents and 46 radionuclides expected in the process [2]. A preprocessing strategy is necessary that is able to combine information from multiple sensors and distinguish components in the complex spectra.
There are two on-line sensors investigated in this study. Raman and Attenuated Total Reflectance â Fourier Transform Infrared (ATR-FTIR) spectroscopy. These molecular measurement techniques provide a high dimensional space of 1472 wavenumbers to monitor the process for decision-making. In addition to the high-dimensional space, there are spectral components that may appear under actual process conditions that were not included in model training. The high dimensionality and possible process noise motivate the use of feature selection to reduce dimensionality prior to model input.
The feature selection method used is a general forward selection wrapper method [3]. A wrapper method selects important features using the quantification model, in this case Partial Least Squares Regression (PLSR), to determine the most important subset of features. The general forward selection method used in this study has two distinct steps. Two steps are necessitated by the high dimensionality of the problem. Testing every possible subset of features is an NP-hard problem resulting in 21472 possible combinations [3]. Therefore, a feature ranking will be established based on heuristics and then the optimum number of these ordered features will be determined. This reduces the number of possible feature subsets from 21472 to 1472.
The first step is ranking the features (wavenumbers) based on spectral intensity in the training set. This follows the intuition that (after baseline correction) features with high intensities correspond to useful spectral information, based on signal-to-noise ratio arguments. The second step is determining how many of these ordered features give optimum performance on test spectra via cross validation. Root Mean Squared Error was used as the primary error metric for evaluating performance. In summary, the features are ranked on the training data and then the optimum number of features are chosen on test spectra with species not included in model training.
A similar established wrapper method, the Successive Projections Algorithm (SPA), was compared for performance and determined to underperform the more general forward selection method. SPAâs worse performance is attributed to its performance on the highly collinear information present in the spectra. As part of the SPA algorithm, mutual information from the already chosen features is subtracted. Since wavenumbers in spectral peaks share much mutual information with adjacent wavenumbers, subtraction of mutual information quickly leads the algorithm to select wavenumbers based on noise rather than information (noise is the primary information left after several iterations of the SPA algorithm).
Data fusion is implemented to combine the information from the Raman and ATR-FTIR sensors after the feature selection is applied. There are multiple levels of data fusion, as shown by Borras et al [4]. In this work, data-level fusion is used through concatenation. Concatenation is applied because it includes all information from both instruments (after feature selection is applied). Standard scaling, in addition to dimensional reduction in the model (PLSR), removes the physical differences between the spectra. This allows the information to be input simultaneously into a single model.
To conduct the study, simulants of nuclear waste mixtures are used. These simulants consist of water as a solvent with seven dissolved sodium salts. These salts are sodium: nitrate, nitrite, sulfate, carbonate, oxalate, phosphate, and acetate. Of the seven salts, four (nitrate, nitrite, sulfate, and carbonate) are considered âtarget speciesâ and are included in the training dataset. The ânon-targetâ species are the remaining three salts (oxalate, phosphate, and acetate). The non-target salts are used to simulate unanticipated feed conditions expected at the Hanford vitrification process. The quantification model used is Partial Least Squares Regression. This model is chosen because of its documented success for quantifying spectra [5].
Our results show that feature selection combined with data fusion of Raman and ATR-FTIR provides more accurate analysis of nuclear waste mixtures than either instrument alone. We are able to reduce mean percent errors from 43.2% (Raman) and 15.8% (ATR-FTIR) to 5.6% error for a method utilizing forward selection and data fusion. This improvement is due to the processing strategies used on the data since the data since the same measurements were used.
This work has the potential to improve real-time monitoring at the Hanford Site in Washington State. Better data processing strategies can improve process monitoring and decrease process downtime. The approach in this work can be applied to other processes outside the domain of nuclear waste treatment. Application-specific data processing strategies are often necessitated by the challenges faced in modern processing plants.
References:
[1] I. A. Udugama et al., âThe Role of Big Data in Industrial (Bio)chemical Process Operations,â Ind. Eng. Chem. Res., vol. 59, no. 34, pp. 15283â15297, 2020, doi: 10.1021/acs.iecr.0c01872.
[2] S. Kocevska, G. M. Maggioni, R. W. Rousseau, and M. A. Grover, âSpectroscopic Quantification of Target Species in a Complex Mixture Using Blind Source Separation and Partial Least-Squares Regression: A Case Study on Hanford Waste,â Ind. Eng. Chem. Res., vol. 60, no. 27, pp. 9885â9896, 2021, doi: 10.1021/acs.iecr.1c01387.
[3] G. Chandrashekar and F. Sahin, âA survey on feature selection methods,â Comput. Electr. Eng., vol. 40, no. 1, pp. 16â28, 2014, doi: 10.1016/j.compeleceng.2013.11.024.
[4] E. Borras, J. Ferre, R. Boque, M. Mestres, L. Acena, and O. Busto, âData fusion methodologies for food and beverage authentication and quality assessment - A review,â Anal. Chim. Acta, vol. 891, 2015, doi: 10.1016/j.aca.2015.04.042.
[5] P. Tse, J. Shafer, S. A. Bryan, and A. M. Lines, âQuantification of Raman-Interfering Polyoxoanions for Process Analysis: Comparison of Different Chemometric Models and a Demonstration on Real Hanford Waste,â Environ. Sci. Technol., 2021, doi: 10.1021/acs.est.1c02512.