(203e) Improving Data Sub-Selection for Supervised Tasks with Principal Covariates Regression
AIChE Annual Meeting
2021
2021 Annual Meeting
Topical Conference: Applications of Data Science to Molecules and Materials
Innovations in Methods of Data Science
Monday, November 8, 2021 - 4:30pm to 4:45pm
Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. We have recently introduced methods that incorporate PCovR into two popular data selection approaches, CUR and Farthest Point Sampling, which iteratively identify the most diverse samples and discriminating features. While our approach is completely general, here we focus on systems relevant to atomistic simulations, chemistry, and materials science -- fields where feature and sample selection are an increasingly common practice. Our results show that these selection methods identify data subsets that out-perform their unsupervised counterparts--which we demonstrate with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks.
This work pulls from:
Structure-Property Maps with Kernel Principal Covariates Regression
BA Helfrecht, RK Cersonsky, G Fraux, M Ceriotti
Machine Learning: Science and Technology 1 (4)
Improving Sample and Feature Selection with Principal Covariates Regression
RK Cersonsky, BA Helfrecht, EA Engel, M Ceriotti
arXiv preprint arXiv:2012.12253