(191a) Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System | AIChE

(191a) Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System

Authors 

Shacham, M. - Presenter, Ben Gurion University of the Negev
Paster, I. - Presenter, Ben Gurion University of the Negev
Tovarovski, G. - Presenter, Tel-Aviv University
Rowley, R. L. - Presenter, Brigham Young University
Brauner, N. - Presenter, Tel-Aviv University


Pure-compound property data are at present available only for a small fraction of compounds, pertaining to such diverse areas as chemistry and chemical engineering, environmental engineering and environmental impact assessment, hazard and operability analysis. Therefore, methods for reliable prediction of property data are needed. Current methods used to predict physical and thermodynamic properties can be classified into "group contribution" methods (see, for example, Marrero and Gani, 2001), methods based on the "corresponding-states principle", (Teja, Sandler and Patel, 1981, Poling et al., 2001), "asymptotic behavior" correlations (Marano and Holder, 1997) and Quantitative Structure Property Relationships (QSPRs, Dearden et al., 2003). These methods typically provide a model that can predict one property, and separate models have to be derived for every property. Our objective is to develop a system which can be used as a general prediction tool irrespective of the property to be predicted. The recently developed targeted QSPR method (Brauner, et al., 2006, Shacham et al., 2007) is to be employed in such a system. Prediction of a property (the target property) for a particular compound (the target compound) using the TQSPR method is carried out in two stages. The first stage involves the identification of a similarity group (typically of around 20 compounds) structurally related to the target compound. For identification of the similarity group, a large database of molecular descriptors is used. The similarity between potential predictive compounds and the target compound is measured by the partial correlation coefficient between the vector of the molecular descriptors of the target compound and that of a potential predictive compound. The training set is established by selecting the first n (typically 10) compounds with the highest correlation coefficient values, for which target property data are available. In the second stage of the TQSPR method a stepwise regression program (SROV, Shacham and Brauner, 2003) is used to derive a linear regression model (TQSPR) which best represents the target property value in the training set in terms of selected molecular descriptors. One to three descriptors are typically used in the TQSPR. Finally, the TQSPR is used for predicting the target property for the target compound. Kahrs et al., 2008, carried out an evaluation of the TQSPR method using a database which contained 1630 molecular descriptors for 259 hydrocarbons. Only the prediction of the critical temperature (Tc) was tested in that study. The objective of the present study is to carry out a similar evaluation with a much wider variety of chemical compounds and for a large number of pure component constant properties. To this aim, a new database that contains physical property data for 1798 compounds has been established. Included in this data base are numerical values and data uncertainty for 31 properties (critical properties, normal melting and boiling temperatures, heat of formation, flammability limits etc.). All the property data is from the DIPPR database (Rowley et al., 2010). The database contains 3224 molecular descriptors generated by the Dragon, version 5.5. software (DRAGON is copyrighted by TALETE srl, http://www.talete.mi.it) from minimum energy 3-D molecular models. The 3-D molecular structures were optimized for about a 1000 compounds in Gaussian 03 (Frisch, et al., 2004) using B3LYP/6-311+G (3df, 2p), a density functional method with a large basis set. Most of the other compounds were optimized using HF/6-31G*, a Hartree-Fock ab initio method with a medium-sized basis set. To carry out the study, a MATLAB based Graphical User Interface (GUI) has been developed. This GUI provides user access to the database and enables carrying out the various stages of the TQSPR method using the SROV stepwise regression program. The user selects first the target property and the target compound. After that he can select algorithmic parameters by changing the default settings. Parameters that can be changed include, for example, the level of downsizing of the database by removing noisy descriptors and/or descriptors with low information content (e.g., using principal component analysis), similarity measures for selecting the similarity group and training set (e.g., Euclidean distance, correlation coefficient), the cluster algorithm used, and the regression stopping criteria for addition of descriptors to the TQSPR model. The identification of the TQSPR can be carried out automatically or "manually", where the user can override the programs' recommendations with regard to the descriptors selected to the TQSPR model. Preliminary results show that the TQSPR-based property prediction system can predict most properties for a wide variety of compounds (including hydrocarbons, organic compounds containing O, N, S and Cl atoms, etc.) within DIPPR recommended uncertainty level. Detailed results and their discussion will be provided in the extended abstract and in the conference presentation. We believe that the TQSPR-based property prediction system enables optimal utilization of the property related information available in the DIPPR database for prediction of properties on basis of molecular structure. The system can be used for analysis of the consistency of the data available in databases, as well as prediction of unknown properties of existing or not yet synthesized compounds. References

1. Brauner, N; Stateva, R. P.; Cholakov, G. St.; Shacham, M. Structurally ?Targeted? Quantitative Structure-Property Relationship Method for Property Prediction. Ind. Eng. Chem. Res. 2006, 45, 8430-8437. 2. Brauner, N.; Cholakov, G. St.; Kahrs, O.; Stateva, R. P.; Shacham, M. Linear QSPRs for Predicting Pure Compound Properties in Homologous Series. AIChE J. 2008, 54(4), 978-990. 3. Dearden, J. C. ?Quantitative Structure?Property Relationships for Prediction of Boiling Point, Vapor Pressure, and Melting Point?, Environmental Toxicology and Chemistry, 22( 8), 1696?1709 (2003). 4. Frisch, M.J.; Trucks, G.W.; Schlegel, H.B.; et al. Gaussian03, Revision A.6; Gaussian, Inc., Pittsburgh, PA, 2004. 5. Kahrs, O.; Brauner, N; Cholakov, G. St.; Stateva, R. P.; Marquardt, W.; Shacham, M. Analysis and Refinement of the Targeted QSPR Method. Computers Chem. Engng. 2008, 32 (7) 1397-1410. 6. Marano, J.J.; Holder, G.D. General Equations for Correlating the Thermo-physical Properties of n-Paraffins, n-Olefins and other Homologous Series. 2. Asymptotic Behavior Correlations for PVT Properties. Ind. Eng. Chem. Res. 1997A, 36, 1895. 7. Marrero, J.; Gani, R. Group-contribution based estimation of pure component properties. Fluid Phase Equilibrium. 2001, 183. 8. Poling, B.E., Prausnitz, J. M., O'Connel, J. P., Properties of Gases and Liquids, 5th Ed., McGraw-Hill, New York (2001). 9. Rowley, R.L.; Wilding, W.V.; Oscarson, J.L.; Giles, N.F. DIPPR Data Compilation of Pure Chemical Properties Design Institute for Physical Properties, (http//dippr.byu.edu), Brigham Young University Provo Utah,2010. 10. Shacham, M.; Brauner, N. The SROV Program for Data Analysis and Regression Model Identification. Computers Chem. Engng. 2003, 27, 701. 11. Shacham, M.; Kahrs, O.; St Cholakov, G.; Stateva, R.; Marquardt, W.; Brauner, N. The Role of the Dominant Descriptor in Targeted Quantitative Structure Property Relationships, Chem. Eng. Sci. 2007, 62, (22), 6222-6233. 12. Teja, A.S.;Sandler, S. I.; Patel, N. C., A Generalized Corresponding States Principle Using Two Nonspherical Reference Fluids, Chem. Eng. J. (Laussanne), 1981, 21, 21-28.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00