Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

Traditional approaches for clustering high dimensional data involve dimensionality reduction followed by classical clustering algorithms such as k-means in lower dimensions.
However, approaches based on k-means clustering suffer from the drawbacks and limitations of k-means clustering, namely, high dependency on initialization of cluster centroids, non-repeatability of clustering results. Also, k-means can converge locally and hence does not guarantee optimal clustering.

Proposed Approach:

An optimal clustering approach in one dimension based on dimensionality reduction is proposed.
One dimensional representation of high dimensional data is obtained using Kernel Principal Component Analysis using a suitable kernel function such as Radial Basis Function(RBF).
One dimensional representation is then clustered optimally using dynamic programming algorithm in polynomial time

Contribution:

Developed a program using Python and C programming languages for implementing proposed clustering approach.
Using Silhouette score as a metric to evaluate quality of clustering, the advantages of proposed approach over traditional k-means based approaches are demonstrated.
For testing the proposed approach, a real world high dimensional dataset and a synthetic two dimensional dataset is used.

Programming Languages and Libraries

Programming Languages:

Python
C

Python Libraries

Numpy
Pandas
scikit-learn
Matplotlib
traceback
datetime

Program file Description and Usage

To execute the program, run main.py program. It automatically calls the code present in other files as required. Following is a brief description of program files:

paths.py - Specify dataset,program and output(result) paths in this file.
main.py - Main python program to be executed.
c_thresholding_new.c(C program) - Implements one dimensional optimal clustering.
kmeans.py - Implements k-means clustering algorithm.
hungarian.py - Implements Hungarian algorithm for class-cluster assignment.
silhouette.py - Calculates Silhouette Score for resulting clusters.
gen_moonpairs.py - Generates specified number of half-moon pairs at random positions, angles in 2 dimensional plane.
dataset.py - Used for loading, handling datasets.
labelscan.py - Converts non-numeric class categories to numeric labels.
logger.py - Logs important information(such as runtime exception handling stacktrace), variables values during the program execution.
plot_graph.py - Plots the graph of "Numbers of Clusters"(X-axis) vs "Silhouette Score"(Y-axis) for comparing clustering performance.

Datasets Used

English Letters.csv - Labeled real world dataset containing 20,000 samples of 26 letters of English alphabets. Each sample is a 16 dimensional encoding, and there 26 different class labels corresponding to 26 letters of the English alphabet. Class labels(letter names) are in same file.
moons_data.csv - Labeled dataset datset containing 25 half moon pairs, that is 50 clusters corresponding to 50 individual half moons. Any number of half moon pairs can be generated using gen_moonpairs.py program.
groundtruth.csv - Contains class labels for data points in moon_data.csv.

Contact Information

Nachiket Bhide - bhiden@uwindsor.ca
Dr. Luis Rueda - lrueda@uwindsor.ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

Proposed Approach:

Contribution:

Programming Languages and Libraries

Programming Languages:

Python Libraries

Program file Description and Usage

Datasets Used

Contact Information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
English Letters.csv		English Letters.csv
README.md		README.md
c_thresholding_new.c		c_thresholding_new.c
dataset.py		dataset.py
gen_moonpairs.py		gen_moonpairs.py
groundtruth.csv		groundtruth.csv
hungarian.py		hungarian.py
kmeans.py		kmeans.py
labelscan.py		labelscan.py
logger.py		logger.py
main.py		main.py
moons_data.csv		moons_data.csv
paths.py		paths.py
plot_graph.py		plot_graph.py

Nachiket-Bhide/Masters-Thesis-Research-Work

Folders and files

Latest commit

History

Repository files navigation

Efficient Clustering via Kernel Principal Component Analysis and Optimal One Dimensional Clustering

About Research Work

Motivation:

Proposed Approach:

Contribution:

Programming Languages and Libraries

Programming Languages:

Python Libraries

Program file Description and Usage

Datasets Used

Contact Information

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages