(126a) On the Value of Global Optimality in Machine Learning Problems: The Case of Feature Selection for Linear Support Vector Machines | AIChE

(126a) On the Value of Global Optimality in Machine Learning Problems: The Case of Feature Selection for Linear Support Vector Machines

Authors 

Beitz, A. - Presenter, Clemson University
Blenner, M., Clemson University
Scott, J., Georgia Institute of Technology
Numerical optimization is at the core of many contemporary machine learning algorithms. Often, the optimization problems of interest are nonconvex (e.g., training neural nets) or involve binary decisions (e.g., zero-norm feature selection). However, due to the extreme size of the problems of interest in this domain, the vast majority of optimization algorithms in common use are based on satisfying first-order optimality conditions, and hence furnish only locally optimal solutions, often with no indication of how far from global optimality they may be. In recent years, there has been increasing interest in methods with better optimality guarantees, either through increased capacity to escape from suboptimal local minima, or through convex relaxations or other tractable approximations with known theoretical relationships to the original problem (i.e., provable bounds on the difference in their optimal solutions). There have also been some attempts to apply rigorous global optimization methods, particularly in the case of problems that can be formulated as mixed-integer linear programs, although this is a very significant challenge. The tacit assumption in all of these approaches is that a global solution will necessarily perform better than a suboptimal local solution for the learning task at hand. However, this need not be the case because the performance metric of interest in most machine learning tasks (e.g., accurate classification of new data points in the future) is distinct from the objective function minimized during optimization (e.g., classification error for given training data, along with additional regularization terms). Given this fact, what is the value of achieving global optimality in such problems? When should we expect a clear correlation between suboptimality during training and degraded performance in practice? And how strong is this correlation (i.e., how hard should we be willing to try to achieve global optimality)?

In this talk, we will present our recent results investigating the value of global optimality in the context of zero-norm feature selection for linear support vector machines (SVMs). Linear SVMs aim to classify data points in a high-dimensional space into one or more clusters based on separating hyperplanes. The individual components of each data point are called features. For example, the features may represent expression levels for various genes in a cell, while the clusters might represent different disease classes. In this context, feature selection refers to the problem of selecting a specified number K of features (e.g., genes) from the full set such that classifying the data based on only those K features is as good or better than classifying the data based on any other choice of K features. This problem is motivated by the observation that classifiers based on fewer features often generalize better from the training data to new data sets, especially when there are many noisy features in the original data set. Moreover, some interesting applications call for a very aggressive reduction in the number of features used for classification. For example, in the design of biosensors that aim to infer the levels of pollutants or toxins in their environment based on the gene expression of immobilized microbes, it is impractical to accurately measure more than a small number of genes (~5).

In general, zero-norm feature selection is a very large-scale mixed-integer quadratic optimization problem (MIQP) that is difficult to solve to global optimality. However, modern MIQP solvers are capable of solving such problems for data sets of modest size. Using several standard data sets in this size range from various application areas, we tested the classification accuracy of SVMs with fixed numbers of features obtained by solving the zero-norm feature selection problem to guaranteed global optimality, as well as by using a variety of more efficient approximations that are in common use but furnish only approximate solutions. Moreover, to more directly test the correlation between suboptimality and classification accuracy, we also developed a rigorous technique for computing a sequence of progressively more suboptimal SVMs for each test problem in a controlled manner, ranging from globally optimal solutions to solutions with objective values that deviate from global optimality by up to 25%. For all of the SVMs obtained, we then evaluated their classification errors on new data using a standard five-fold validation procedure. Interestingly, our results indicate that, when a moderate to large number of features are allowed to be retained (>15%), there is very little correlation between suboptimality and classification error. In other words, SVMs that achieve objective values within 10-20% of global optimality during training performed just as well as global solutions on average when classifying new data. In contrast, when only a small number of features are allowed to be retained (~5%), as in the biosensor example mentioned above, a very clear positive correlation emerges between suboptimality during training and degraded classification performance on new data sets. Moreover, in comparisons with SVMs obtained by existing state-of-the-art feature selection algorithms, the SVMs obtained by rigorously solving the feature selection problem to global optimality were not substantially more accurate at high feature selection levels, but showed significant advantages at low feature selection levels (~5%). In this talk, we will present the data supporting these conclusions for several test cases, provide hypotheses for these surprising trends with supporting evidence, and discuss the implications of these trends for the usefulness and practicality of rigorous global optimization for feature selection.