SOS-SDP is an exact algorithm based on the branch-and-bound technique for solving the Minimum Sum-of-Squares Clustering (MSSC) problem described in the paper "SOS-SDP: an Exact Solver for Minimum Sum-of-Squares Clustering". This repository contains the C++ source code, the MATLAB scripts, and the datasets used for the experiments.
V. Piccialli, A. M. Sudoso, A. Wiegele, SOS-SDP: an Exact Solver for Minimum Sum-of-Squares Clustering, INFORMS Journal on Computing (2022), https://doi.org/10.1287/ijoc.2022.1166.
SOS-SDP calls the semidefinite programming solver SDPNAL+ by using the MATLAB Engine API for C++. It requires the MATLAB engine library libMatlabEngine and the Matlab Data Array library libMatlabDataArray. SOS-SDP uses the Armadillo library to handle matrices and linear algebra operations efficiently. Before installing Armadillo, first install OpenBLAS and LAPACK along with the corresponding development files. SOS-SDP implements a configurable thread pool of POSIX threads to speed up the branch-and-bound search.
Ubuntu and Debian instructions:
- Install MATLAB (>= 2016b).
- Install CMake, OpenBLAS, LAPACK and Armadillo:
sudo apt-get update
sudo apt-get install cmake libopenblas-dev liblapack-dev libarmadillo-dev
- Open the makefile
clustering_c++/Makefile
and set the variablematlab_path
with your MATLAB root folder. - Compile the code:
cd clustering_c++/
make
- Download SDPNAL+, move the folder
clustering_matlab
containing the MATLAB source code of SOS-SDP in the SDPNAL+ main directory and set the parameterSDP_SOLVER_FOLDER
of the configuration file accordingly. This folder and its subfolders will be automatically added to the MATLAB search path when SOS-SDP starts.
The code has been tested on Ubuntu Server 20.04 with MATLAB R2020b and Armadillo 10.2.
Various parameters used in SOS-SDP can be modified in the configuration file clustering_c++/config.txt
:
BRANCH_AND_BOUND_TOL
- optimality tolerance of the branch-and-boundBRANCH_AND_BOUND_PARALLEL
- thread pool size: single thread (1), multi-thread (> 1)BRANCH_AND_BOUND_MAX_NODES
- maximum number of nodesBRANCH_AND_BOUND_VISITING_STRATEGY
- best first (0), depth first (1), breadth first (2)SDP_SOLVER_SESSION_THREADS_ROOT
- number of threads for the MATLAB session at the rootSDP_SOLVER_SESSION_THREADS
- number of threads for the MATLAB session for the ML and CL nodesSDP_SOLVER_FOLDER
- full path of the SDPNAL+ folderSDP_SOLVER_TOL
- accuracy of SDPNAL+SDP_SOLVER_VERBOSE
- do not display log (0), display log (1)SDP_SOLVER_MAX_CP_ITER_ROOT
- maximum number of cutting-plane iterations at the rootSDP_SOLVER_MAX_CP_ITER
- maximum number of cutting-plane iterations for the ML and CL nodesSDP_SOLVER_CP_TOL
- cutting-plane tolerance between two consecutive cutting-plane iterationsSDP_SOLVER_MAX_INEQ
- maximum number of valid inequalities to addSDP_SOLVER_INHERIT_PERC
- fraction of inequalities to inheritSDP_SOLVER_EPS_INEQ
- tolerance for checking the violation of the inequalitiesSDP_SOLVER_EPS_ACTIVE
- tolerance for detecting the active inequalitiesSDP_SOLVER_MAX_PAIR_INEQ
- maximum number of pair inequalities to separateSDP_SOLVER_PAIR_PERC
- fraction of the most violated pair inequalities to addSDP_SOLVER_MAX_TRIANGLE_INEQ
- maximum number of triangle inequalities to separateSDP_SOLVER_TRIANGLE_PERC
- fraction of the most violated triangle inequalities to addKMEANS_SDP_BASED
- constrained k-means with k-means++ initialization (0), sdp-based initialization (1)KMEANS_MAX_ITER
- maximum number of k-means iterationsKMEANS_N_START
- number of times k-means is runKMEANS_VERBOSE
- do not display log (0), display log (1)
cd clustering_c++/
./bb <DATASET> <K> <LOG>
DATASET
- the path of the datasetK
- the number of clustersLOG
- the path of the log file
The dataset file contains the data points x_ij
and the must include an header line with the problem size n
and the dimension d
:
n d
x_11 x_12 ... x_1d
x_21 x_22 ... x_2d
...
...
x_n1 x_n2 ... x_nd
The log file reports the progress of the algorithm:
N
- size of the current nodeNODE_PAR
- id of the parent nodeNODE
- id of the current nodeLB_PAR
- lower bound of the parent nodeLB
- lower bound of the current nodeFLAG
- termination flag of SDPNAL+0
- SDP is solved to the required accuracy1
- SDP is not solved successfully-1, -2, -3
- SDP is partially solved successfully
TIME (s)
- computational time in seconds of the current nodeCP_ITER
- number of cutting-plane iterationsCP_FLAG
- termination flag of the cutting-plane procedure-3
- current bound is worse than the previous one-2
- SDP is not solved successfully-1
- maximum number of iterations0
- no violated inequalities1
- maximum number of inequalities2
- node must be pruned3
- cutting-plane tolerance
CP_INEQ
- number of inequalities added in the last cutting-plane iterationPAIR TRIANGLE CLIQUE
- average number of added cuts for each class of inequalitiesGUB
- global upper boundI J
- current branching decisionNODE_GAP
- gap at the current nodeGAP
- overall gapOPEN
- number of open nodes
The log file prints the optimal minimum sum-of-squares (MSSC) objective and the cluster indicator matrix when the algorithm ends (point_id, cluster_id)
.
Log file example:
DATA_PATH, n, d, k: /home/ubuntu/SOS-SDP/datasets/glass.txt 214 9 6
LOG_PATH: /home/ubuntu/SOS-SDP/results/glass_6.txt
BRANCH_AND_BOUND_TOL: 1e-04
BRANCH_AND_BOUND_PARALLEL: 16
BRANCH_AND_BOUND_MAX_NODES: 100
BRANCH_AND_BOUND_VISITING_STRATEGY: 0
SDP_SOLVER_SESSION_THREADS_ROOT: 16
SDP_SOLVER_SESSION_THREADS: 1
SDP_SOLVER_FOLDER: /home/ubuntu/SOS-SDP/SDPNAL+/
SDP_SOLVER_TOL: 1e-05
SDP_SOLVER_VERBOSE: 0
SDP_SOLVER_MAX_CP_ITER_ROOT: 80
SDP_SOLVER_MAX_CP_ITER: 40
SDP_SOLVER_CP_TOL: 1e-05
SDP_SOLVER_MAX_INEQ: 100000
SDP_SOLVER_INHERIT_PERC: 1
SDP_SOLVER_EPS_INEQ: 0.0001
SDP_SOLVER_EPS_ACTIVE: 1e-06
SDP_SOLVER_MAX_PAIR_INEQ: 100000
SDP_SOLVER_PAIR_PERC: 0.05
SDP_SOLVER_MAX_TRIANGLE_INEQ: 100000
SDP_SOLVER_TRIANGLE_PERC: 0.05
KMEANS_SDP_BASED: 1
KMEANS_MAX_ITER: 100
KMEANS_N_START: 50
KMEANS_VERBOSE: 0
| N| NODE_PAR| NODE| LB_PAR| LB| FLAG| TIME (s)| CP_ITER| CP_FLAG| CP_INEQ| PAIR TRIANGLE CLIQUE| GUB| I J| NODE_GAP| GAP| OPEN|
| 214| -1| 0| -inf| 72.9391| 0| 151| 7| 3| 3014| 639.857 4448 8.71429| 72.9709*| -1 -1| 0.000436087| 0.000436087| 0|
| 214| 0| 1| 72.9391| 72.9644| -1| 12| 0| 2| 3014| 0 0 0| 72.9709| 107 201| 8.8639e-05| 8.8639e-05| 0|
| 213| 0| 2| 72.9391| 72.9487| 0| 31| 1| -3| 2485| 0 922 0| 72.9647*| 107 201| 0.000220183| 0.000220183| 0|
| 212| 2| 3| 72.9487| 72.9643| 0| 8| 0| 2| 2043| 0 0 0| 72.9647| 32 51| 5.25193e-06| 5.25193e-06| 0|
| 213| 2| 4| 72.9487| 72.9789| 0| 12| 0| 2| 2485| 0 0 0| 72.9647| 32 51| -0.00019452| -0.00019452| 0|
WALL_TIME: 199 sec
N_NODES: 5
AVG_INEQ: 1203.71
AVG_CP_ITER: 1.6
ROOT_GAP: 0.000436087
GAP: 0
BEST: 72.9647
V. Piccialli, A. Russo Russo, A. M. Sudoso, An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering. Computers & Operations Research (2022).
Piccialli, V., Sudoso, A. M. Global optimization for cardinality-constrained minimum sum-of-squares clustering via semidefinite programming. Mathematical Programming (2023).