Optical Molecular Recognition from Chemical Reaction Mechanism Images | AIChE

Optical Molecular Recognition from Chemical Reaction Mechanism Images

To support the development of AI models in helping organic chemistry research, optical chemical structure recognition (OCSR) systems are designed to build databases by extracting molecular structure information, typically in the form of molecular graphs or SMILES, from chemical molecule images. While many chemical reaction images only included reactant and product molecules, it is not uncommon for organic chemical reaction images to include additional information such as partial charges and curved arrows indicating electron movements. These noises are structurally not differentiable from the essential information in a molecule such as atoms and bonds and cause significant difficulties in OCSR tasks.

We propose a novel pipeline that combines an image-segmentation model and an OCSR model in dealing with information extraction from chemical reaction mechanisms, which typically contain rich information about electron flow. The image-segmentation model is used for data pre-treatment to identify and remove curved arrows indicating electron movement, followed by the OCSR model in molecular identity recognition. In addition, there are no existing benchmarking datasets designed specifically to target chemical reaction mechanisms. We created a dataset of molecule images and structural Molecular identifiers of Molecular images in Chemical Reaction Mechanisms (SMiCRM). It consists of 453 molecular images, with mechanistic features, such as curved arrows and partial charges. They are labeled with their Simplified Molecular Input Line Entry System (SMILES) and their Structural Data Files (SDFs). Comparing the performance of the molecular recognition accuracy using the proposed pipeline and the performances of using only the OCSR model on the dataset collected, the proposed pipeline’s performances significantly improved from 12.09% to 64.5% in exact SMILES matching and from 15.36% to 90.97% in Tanimoto similarity in recognizing the identity of mechanistic molecules of curved arrows removed. We also used this approach for in complete reaction mechanism image parsing and demonstrate positive performance improvements.

In conclusion, the research proposes an autonomous and effective procedure for collecting molecular-level information for chemical reaction mechanisms and highlights the further avenues for improving chemical information extraction methodologies.