(227h) Global Optimization for Clustering Problems with 200,000 Samples

Conference

AIChE Annual Meeting

Year

2021

Proceeding

2021 Annual Meeting

Group

Computing and Systems Technology Division

Session

Advances in global optimization

Time

Tuesday, November 9, 2021 - 10:13am to 10:32am

Authors

Shi, M. - Presenter, University of British Columbia

Hua, K., University of British Columbia

Cao, Y., The University of British Columbia

Clustering is the prototypical unsupervised learning activity that identifies cohesive and well-differentiated groups of records in data [1]. In this paper, we focus on a fundamental target of the clustering problems that minimize the within-cluster sum-of-squared-error, which can be formulated as a Mixed Integer Second Order Cone Programming (MISOCP) problem. There are many heuristic methods proposed for solving the minimum sum-of-squares clustering (MSSC) task. For instance, the k-means clustering algorithm [2] provides a coordinated descent-based method to produce a local solution. However, due to the non-convexity of the MSSC objectives, the classic k-means algorithm is sensitive to the initialization and easy to fall under the local minimum [3].

The MSSC task is rarely investigated in the literature through deterministic optimization to find its global optimal value. One reason is that a classic global optimization algorithm based on branch and bound (BB) scheme needs to perform branching on all binary variables, and the number of binary variables increases linearly as the number of data samples. Therefore, off-the-shelf global solvers cannot solve clustering problems even with hundreds of data points. In this presentation, we proposed a tailed reduced-space BB algorithm and designed four approaches to construct lower and upper bounds at each node in the BB scheme. One key advantage of this reduced-space algorithm is that it only needs to perform branching on the centers of clusters to guarantee convergence, and the size of centers is independent of the number of data samples. Another critical property of this algorithm is that both lower bounds and upper bounds can be obtained without solving any optimization problems. These two properties enable our algorithm to be scalable to the dataset with hundreds of thousands of data points. We performed numerical experiments on both synthetic and real-world datasets and compared our proposed algorithms with the off-the-shelf global optimal solvers and classical local optimal algorithms. The results reveal a strong performance and scalability of our algorithm. For example, the parallel version of our algorithm can solve the MSSC problem to an optimality gap of 1% on a synthetic dataset of 210,000 samples (i.e., 100 times larger than the state-of-the-art method [4] in the literature), using 100 CPU cores.

[1] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651â€“666, 2010.

[2] Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129â€“137, 1982.

[3] Xu, J. and Lange, K. Power k-means clustering. In International Conference on Machine Learning, pp. 6921â€“6931, 2019.

[4]. Aloise, D., Hansen, P. and Liberti, L. An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming, 131(1):195-220, 2012.

Topics

Process Design & Development

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2024 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: October 2024

CEP: September 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.