(375q) Contrastive Learning to Improve Pharmaceutical Knowledge Graph Quality in Machine Learning
AIChE Annual Meeting
2024
2024 AIChE Annual Meeting
Computing and Systems Technology Division
Interactive Session: Data and Information Systems
Tuesday, October 29, 2024 - 3:30pm to 5:00pm
Knowledge graphs (KGs) are increasingly used to represent information for a diverse set of domains, in applications such as retrieval-augmented generation for large language models. In the pharmaceutical domain, KGs can be generated from source documents using tools such as SUSIE [1], an ontology-based [2] pharmaceutical information extraction tool. However, one issue is that redundant, contradictory, or incorrect relations can be generated in the final KGs. These issues come mainly from the complex sentence structures in technical source documents. To address this problem, we present a method to rank relation accuracy and to identify and remove erroneous relations in the KGs by using contrastive learning (CL). This method can improve the space efficiency of KG-based knowledge bases that store information extracted from technical pharmaceutical documents. We show that by applying CL to identify and remove erroneous relations, the size of the original KG can be significantly reduced while maintaining key information. We also investigate how the quality of extracted KGs varies across the major categories of pharmaceutical source documents and how CL-based methods can affect KGs extracted from documents of each category.
Bibliography:
[1] Mann V., Viswanath S., Vaidyaraman S., Balakrishnan J., Venkatasubramanian V., (2023). SUSIE: Pharmaceutical CMC ontology-based information extraction for drug development using machine learning, Computers & Chemical Engineering, Volume 179, 108446. https://doi.org/10.1016/j.compchemeng.2023.108446.
[2] Remolona M. F. M., Conway M. F. , Balasubramanian S., Fan L., Feng Z., Gu T., Kim H., Nirantar P. M. , Panda S., Ranabothu N. R. , Rastogi N., Venkatasubramanian V., (2017). Hybrid ontology-learning materials engineering system for pharmaceutical products: Multi-label entity recognition and concept detection. Computers & Chemical Engineering, Volume 107, Pages 49-60. https://doi.org/10.1016/j.compchemeng.2017.03.012.