(87c) Text Data Feature Extraction Via NLP Embeddings Methods: Robustness and Power Assessment
AIChE Spring Meeting and Global Congress on Process Safety
2023
2023 Spring Meeting and 19th Global Congress on Process Safety
Industry 4.0 Topical Conference
Emerging Technologies in Data Analytics
Tuesday, March 14, 2023 - 2:30pm to 3:00pm
Even though the measurement instrumentation diversity is increasing, the sensors are not the only data sources existing in the CPIs databases. The text data provided from reports, alarms, process tags, etc. are potential interesting and diverse sources of information. These data can contain relevant aspects that sensors are not able to capture. Proper handling of process text data can therefore bring more information for process diagnosis, monitoring and control.
With the recent advances in Natural Language Processing (NLP) [4]; new methods are available that allow to extract features from text data beyond simple frequency counting. The semantics, i.e., the meaning of the text can also be codified in a structured numerical feature, which can be used for process analysis. However, the understanding of a given NLP model is still quite complex, and they are essentially used as black-boxes. Additionally, the power and robustness of this kind of models is still not explored in the CPI context. Therefore, we explore several NLP models for text embedding task, in the scope of a real process, in order to perform an exploratory analysis of the information content and potential associated value for process tuning [5]. Dimension reduction [6] and clustering [7] methods were used to assess the methods and derive several robustness and power metrics.
References
[1] C. H. Goh, «Representing and reasoning about semantic conflicts in heterogeneous information systems», Thesis, Massachusetts Institute of Technology, 1997. Acedido: 23 de outubro de 2019. [Em linha]. DisponÃvel em: https://dspace.mit.edu/handle/1721.1/10713
[2] V. Sheokand e V. Singh, «Modeling Data Heterogeneity Using Big DataSpace Architecture», em Advanced Computing and Communication Technologies, vol. 452, R. K. Choudhary, J. K. Mandal, N. Auluck, e H. A. Nagarajaram, Eds. Singapore: Springer Singapore, 2016, pp. 259â268.
[3] M. S. Reis, R. D. Braatz, e L. H. Chiang, «Big Data - Challenges and Future Research Directions», Chemical Engineering Progress, n.o Special Issue on Big Data(March), pp. 46â50, 2016.
[4] D. Antons, E. Grünwald, P. Cichy, T. O. Salge, e T. O. Salge, «The application of text mining methods in innovation research: current state, evolution patterns, and development priorities», R & D Management, vol. 50, n.o 3, pp. 329â351, jun. 2020, doi: 10.1111/radm.12408.
[5] K. Lu, A. Grover, P. Abbeel, e I. Mordatch, «Pretrained Transformers as Universal Computation Engines». arXiv, 30 de junho de 2021. Acedido: 1 de setembro de 2022. [Em linha]. DisponÃvel em: http://arxiv.org/abs/2103.05247
[6] L. McInnes, J. Healy, e J. Melville, «UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction», arXiv:1802.03426 [cs, stat], 2018, Acedido: 12 de outubro de 2020. [Em linha]. DisponÃvel em: http://arxiv.org/abs/1802.03426
[7] L. McInnes e J. Healy, «Accelerated Hierarchical Density Clustering», em 2017 IEEE International Conference on Data Mining Workshops (ICDMW), nov. 2017, pp. 33â42. doi: 10.1109/ICDMW.2017.12.