(566f) Ontospecies: A Dynamic Knowledge Graph for the Representation of Chemical Species. | AIChE

(566f) Ontospecies: A Dynamic Knowledge Graph for the Representation of Chemical Species.

Authors 

Pascazio, L., University of Cambridge
Akroyd, J., University of Cambridge
Mosbach, S., University of Cambridge
Kraft, M., Uiv of Cambridge
Data is beginning to play a greater role in transforming the chemical engineering research and practice. The size of chemistry databases, their complexity, and their number have increased continuously over time. One of the most comprehensive general public database is PubChem. It hosts information on more than 60 million unique chemical structures and it serves as a key chemical information resource for researchers in many biomedical science areas, including cheminformatics, chemical biology, and medicinal chemistry. PubChem provides various tools that fulfil criteria for simple and effective searching. However, these tools are insufficient if the search needs to be designed to fulfil complex criteria as well as if new information needs to be derived from existing data.

Since the landmark publication by Berners-Lee et al., the semantic web field has envisioned the next generation of the web in both a human- and machine-readable format and in recent years it is emerging as an increasingly important approach for better scientific data sharing and faster data processing using computers.

The World Avatar (TWA) project uses the semantic web technologies to create a digital ’avatar’ of the real world. The digital world is composed of a dynamic knowledge graph that contains concepts and data that describe the world, and an ecosystem of autonomous computational agents that simulate the behaviour of the world and that continuously update the concepts and data. A knowledge graph (KG) is a network of data expressed as a directed graph, where the nodes of the graph are concepts or their instances (data items) and the edges of the graph are links between related concepts or instances. This provides a powerful means to host, query and traverse data, and to find and retrieve related information. The autonomous computational agents are the key aspect of the dynamic nature of the KG. They continuously and independently act on the KG performing various tasks with the aim of producing a self-growing, self-updating, and self-improving ecosystem.

To create a digital representation of the world that bridges the molecular-scale chemistry level to real world macroscale phenomena and enables cross-domain applications, it is crucial to have a rich general chemistry domain. Exposing the PubChem data to semantic web services may help in this regard. Due to the difficulties on dealing with data from different sources and mostly collected in the form of strings, the current databases that translate data from PubChem in relational databases do not include all the available information that can be accessed in the web. Properties like boiling point, melting point, density or solubility as well as spectral information on chemical species are currently not available in any relational database.

The purpose of this work was to develop an ontology, OntoSpecies, that describes chemical species and their properties and that aims to serve as core of the chemistry domain of TWA KG and at the same time address some of the limitations of previous chemistry relational databases. Specifically, the resulting ontology:

- Provides knowledge on general chemistry concepts related to chemical species through the integration of PubChem data on compounds into the database using a software agents. Concepts that are currently not exported in any relational database are also included in the ontology.

- Gives access to the dataset through a SPARQL endpoint. This will remove the inherent limitations of using the web-based PubChem resource (such as inability to construct complicated queries using the available web-based interfaces) by allowing a researcher to use readily available semantic technologies to query and analyze PubChem data on local computing resources.

- Uses a dynamic knowledge-graph-based approach to have a self-growing database that not only integrates data from different sources but also creates and infers knowledge through the uses of software agents.

It is anticipated that our approach will play a key role in the next generation of chemical informatics. The ontological format permits advanced queries, and easy data analysis and visualization. This can be used to compare chemical properties of similar compounds, find compounds with required characteristics as well as automate laborious data gathering from researcher. We show how tasks like the identification of species in unknown mixture based on NMR spectrum, the selection of suitable solvents based on multiple criteria or the investigation of trends in chemical properties can be addressed using SPARQL queries in combination with the use of software agents to postprocess the information obtained as query result. We also show how the ontological format is beneficial to maintain and enrich the data set, as well as to check the consistency and accuracy of the data. Finally, the link between OntoSpecies and other ontologies in TWA is discussed in the context of laboratory automation and cross-domain applications in TWA ecosystem.