The Breathing K-Means Algorithm (with examples)

The Breathing K-Means is an approximation algorithm for the k-means problem that (on average) is better (higher solution quality) and faster (lower CPU time usage) than k-means++.

Preprint: https://arxiv.org/abs/2006.15666 (submitted for publication)

Upon request comparative experiments with the "Hartigan-Wong" algorithm (the default k-means method in R) were made (post-submission and confirming the choice of k-means++ as point of reference).

Typical results for the "Birch1" data set (100000 points drawn from a mixture of 100 circular Gaussians). k=100

Can you spot the mistakes? :-)

Installation from pypi

pip install bkmeans

Local installation to run the examples

Clone the repository

git clone https://github.com/gittar/breathing-k-means

Enter the top directory.

cd breathing-k-means

Create the conda environment 'bkm' (or any other name) via

conda env create -n bkm -f environment.yml

Activate the created environment via

conda activate bkm

To run a jupyter notebook with examples, type, e.g.:

jupyter lab notebooks/2D.ipynb

Content

The top level folder contains the following subfolders

data/ - data sets used in the notebooks
notebooks/ - jupyter notebooks with all examples from the preprint
- 2D.ipynb 2D problems executed with helper functions for brevity
- 2D_detail.ipynb 2D problems executed with raw API
- 10+D.ipynb high-dimensional problems based on the data sets from the original k-means++ publication
src/
- bkmeans.py - reference implementation of breathing k-means
misc/
- aux.py - auxiliary functions
- dataset.py - general class to administer and plot data sets
- runfunctions.py - wrapper functions used in the notebook

API

The included class BKMeans is subclassed from scikit-learn's KMeans class and has, therefore, the same API. It can be used as a plug-in replacement for scikit-learn's KMeans.

There is one new parameters which can be ignored (left at default) for normal usage:

m (breathing depth), default: 5

The parameter m can also be used, however, to generate faster ( 1 < m < 5) or better (m>5) solutions. For details see the preprint.

Example 1: running on simple random data set

Code:

import numpy as np
from bkmeans import BKMeans

# generate random data set
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)

Output:

1.1775040547902602

Example 2: comparison with k-means++ (multiple runs)

Code:

import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans

# random 2D data set
X=np.random.rand(1000,2)

# number of centroids
k=100

for i in range(5):
    # kmeans++
    km = KMeans(n_clusters=k)
    km.fit(X)

    # breathing k-means
    bkm = BKMeans(n_clusters=k)
    bkm.fit(X)

    # relative SSE improvement of bkm over km++
    imp = 1 - bkm.inertia_/km.inertia_
    print(f"SSE improvement over k-means++: {imp:.2%}")

Output:

SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%

Acknowledgements

Kudos go the scikit-learn team for their excellent sklearn.cluster.KMeans class, also to the developers and maintainers of the other packages used: numpy, scipy, matplotlib, jupyterlab

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
HartiganWong		HartiganWong
bkmeans		bkmeans
data		data
img		img
misc		misc
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Breathing K-Means Algorithm (with examples)

Installation from pypi

Local installation to run the examples

Content

API

Example 1: running on simple random data set

Example 2: comparison with k-means++ (multiple runs)

Acknowledgements

About

Releases

Packages

Languages

License

gittar/breathing-k-means

Folders and files

Latest commit

History

Repository files navigation

The Breathing K-Means Algorithm (with examples)

Installation from pypi

Local installation to run the examples

Content

API

Example 1: running on simple random data set

Example 2: comparison with k-means++ (multiple runs)

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages