(223d) Molecular Descriptor Based Random Forest Predictors of Ionic Liquids
AIChE Annual Meeting
2014
2014 AIChE Annual Meeting
Computational Molecular Science and Engineering Forum
Poster Session: Computational Molecular Science and Engineering Forum (CoMSEF)
Monday, November 17, 2014 - 6:00pm to 8:00pm
\376\377\000P\000T\000P\000 \000a\000b\000s\000t\000r\000a\000c\000t\000 \000I\000D\000 \0003\0005\0008\0006\0004\0004
MOLECULAR DESCRIPTORS BASED RANDOM FOREST PREDICTORS OF ILs PROPERTIS
Zelimir F. Kurtanjek
University of Zagreb, Croatia
Abstract
Ionic liquids (ILs) have been considered as basis for numerous new green and sustainable technologies. Numerous possible combinations of cations and anions enable application of computer combinatorial design of ILs to tailor properties to specific technology constraints. Here are applied random forest models based on 797 molecular descriptors for prediction of physical and biological properties. The models are derived for combinations of the following cations: imidazole, pyridinium, quinolinium, ammonium, phosphonium; and anions: BF4, Cl, PF6, Br, CFNOS, NCN2, C6F18PBF4, Cl, PF6, Br, CFNOS, C6F18P. The matrix of molecular descriptors is decomposed by singular value decomposition to extract maximum information contained in variance and also by maximization of t-statics for class separation. In particular, derived are the models for Ils biological impact (acetycholinesterase inhibition), viscosity and melting temperature. From component analysis of the principal targets the most important molecular descriptors for each of the properties are determined. The derived models are validated by ten fold cross validation resulting in 10% average errors by individual decision trees, and considerable improved prediction accuracy by random forest predictors at level 2%.
Key words: Molecular descriptors, ionic liquids, random forest model
Introduction
For last several decades application of ILs has been in focus for development of new green and sustainable technologies. Among many promising examples are productions of biofuels from waste cellulose and polymeric products from biomass feedstock.1 Practically limitless combinatorial opportunities for selection of cations and anions enables tailoring ILs properties to specific technological needs. About 30 physical, chemical and biological properties are available in NIST and Merck data bases and literature.2-4 The combinatorial opportunities have led to development of mathematical models for property predictions and ILs computer design. In literature are applied various models, mostly based on assumption of functional continuity, linear or nonlinear, between preselected molecular identities and properties. This produced models with acceptable applicability, but constrained to specific pre- selections. In order to release constraints, in this work is given modelling based on extensive list of molecular descriptors and fingerprints capturing complex chemical information. Calculated are in total about 1600 descriptors, for each ion 663 1-2 D, 134 3D and 10 molecular fingerprints.5 The variable space of descriptors for each ion is decomposed by singular value decomposition to 10 principal components with 99,2 % of retained variance.
Model
Molecular descriptors of the selected cations and anions form the data matrix X(NMx2ND)= (XC,XA), and the corresponding properties (EC50, µ, Tm) form Y (NMx3) with NM=400 and ND=797. Prior to modeling all data are auto-scaled. Correlation matrices for cations and anions reveal high degree of interdependences with average R2 =0.4 (presented in Fig.1). Each of the matrices is represented by singular value decomposition (SVD). Applied are decompositions of the variances for extraction of molecular similarities (Eq. 1) and for maximization of t-statistics
for class separation (Eq. 2).
C C vC ,i
C ,i vC ,i
A A v A,i
A,i v A,i
i 1,2L N D
(1)
C 1 V l
i l i
LDA
X l1 l1 L l K
(2)
Each ILs combination is projected to the low dimensional spaces spanned by first 10 corre-
sponding SVD eigenvectors (Eq. 3).
PC X
v C ,1 v C , 2 L v C , K
PA X
v A,1 v A, 2 L v A, K
(3)
Applied are decision tree (DT) supervised trainings for classification and regression (Eq. 4) and
for each model a consortium of 500 trees forming random forests (RF), Eq. 5.
DT DT Y, PC , PA
ˆ L DT Y, LDA
(4)
RF RF Y, PC , PA
ˆ L RF Y, LDA
(5)
All calculations are performed by the algorithms provided in R software.6-8
Fig. 1. „Heat maps“ of Pearson R2 matrices and SVD of ILs descriptors (A cations, B
anions).
A B
Fig.2. DT (decision tree) models for ILs classification (L low, M medium, H high, VH very high)
of acetylcholinesterase inhibition. (A) LDA linear discrimination and (B) PCA decomposition.
DT(PCA) DT(LDA) RF(PCA) Exp.
Fig. 3. Stacked bar charts for predictions ILs toxicity classification (L, M, H, VH) based on decision tree DT(PCA error 10%, LDA error 13%), random forest (RF, error 2%).
Fig. 4. Spectrum of the molecular descriptor contributions to “score 3” corresponding to the top of the PCA decision tree.
Results and Discussion
Results for correlation and SVD analysis presented in Fig. 1 reveal high degree of depen- dencies among the descriptors for the given set of ions yielding effective decomposition by the first ten components for each cation and anion data set. Results of supervised classification by individual decision trees are given in Fig. 2-3. LDA and PCA decomposition give similar classi- fication errors of about 10 %. However, combinations of 500 trees into random forest models improved classification to error of 2 %. Inspection of variable importance, i.e. individual molecular descriptor contribution to the key (top) tree branching reveal that for the enzyme level inhibition the most contributing descriptors are: 1) number of aromatic atoms and bonds; 2) sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule (including implicit hydrogens); 3) molecule complexity descriptor; 4) number of P atoms; 5) molecular distance edge between all secondary and tertiary carbons; 6) excessive molar refraction, 7) number of 6 membered rings; 8) number of
5 and 6 membered rings (includes counts from fused rings); 9) number of 6-membered rings
(includes counts from fused rings) containing heteroatoms (N, O, P, S, or halogens).
Acknowledgement
This work was supported by Ministry of Science, Education and Sports of Republic of Croatia, project 058-1252086-0589.
References
1. Liu W., Zheng F., Li J., Cooper A., An Ionic Liquid Reaction ans Separation Process for
Production of Hydroxymethylfurfural from Sugars, AIChE J., 2014: 60: 300-314
2. NIST, Ionic Liquids Database Standard Reference Database #147, <ilthermo.bouder.- nist.gov /ILThermo/mainmenu.uix>.
3. The UFT/ Merck Ionic Liquids Biological Effects Database, <www.il-eco.uft.uni- bremen.de>, accessed on14/12/2013.
4. Suojiang Z., Lu X., Zhou Q.,. Li X, Zhang X., Li S., Ionic Liquids, Physichochemical
Properties, Elsevier, 2009.
5. Chun Wei Yap, PaDEL-Descriptor: An Open Source Software to Calculate Moleclar
Descriptors and Fingerprints, J. Comp. Chem., 2010:32: 1466-1474
6. Svetnik V., Liaw A., Tong C., Culberson J.C., Sheridan R.P., Feuston B.P., Random forest:
a classification and regression tool for compound classification and QSAR modeling, J. Chem. Inf. Comput. Sci., 2003; 43: 1947–1958.
7. Breiman L., Cutler A.,< www.stat.berkeley.edu/~breiman/ RandomForests>.
8. R Development Core Team, , R: A language and environment for statistical computing. R, Vienna, Austria,www.R-project.org 2011.
T. Therneau, B. Atkinson, B. Ripley, 2013, CRAN – Packagge rpart,
<cran.rproject.org/web/ packages/rpart/index.html>.
Topics
Checkout
This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.
Do you already own this?
Log In for instructions on accessing this content.
Pricing
Individuals
AIChE Pro Members | $150.00 |
AIChE Graduate Student Members | Free |
AIChE Undergraduate Student Members | Free |
AIChE Explorer Members | $225.00 |
Non-Members | $225.00 |