- Traditional approaches for clustering high dimensional data involve dimensionality reduction followed by classical clustering algorithms such as k-means in lower dimensions.
- However, approaches based on k-means clustering suffer from the drawbacks and limitations of k-means clustering, namely, high dependency on initialization of cluster centroids, non-repeatability of clustering results. Also, k-means can converge locally and hence does not guarantee optimal clustering.
- An optimal clustering approach in one dimension based on dimensionality reduction is proposed.
- One dimensional representation of high dimensional data is obtained using Kernel Principal Component Analysis using a suitable kernel function such as Radial Basis Function(RBF).
- One dimensional representation is then clustered optimally using dynamic programming algorithm in polynomial time
- Developed a program using Python and C programming languages for implementing proposed clustering approach.
- Using Silhouette score as a metric to evaluate quality of clustering, the advantages of proposed approach over traditional k-means based approaches are demonstrated.
- For testing the proposed approach, a real world high dimensional dataset and a synthetic two dimensional dataset is used.
- Python
- C
- Numpy
- Pandas
- scikit-learn
- Matplotlib
- traceback
- datetime
To execute the program, run main.py program. It automatically calls the code present in other files as required. Following is a brief description of program files:
- paths.py - Specify dataset,program and output(result) paths in this file.
- main.py - Main python program to be executed.
- c_thresholding_new.c(C program) - Implements one dimensional optimal clustering.
- kmeans.py - Implements k-means clustering algorithm.
- hungarian.py - Implements Hungarian algorithm for class-cluster assignment.
- silhouette.py - Calculates Silhouette Score for resulting clusters.
- gen_moonpairs.py - Generates specified number of half-moon pairs at random positions, angles in 2 dimensional plane.
- dataset.py - Used for loading, handling datasets.
- labelscan.py - Converts non-numeric class categories to numeric labels.
- logger.py - Logs important information(such as runtime exception handling stacktrace), variables values during the program execution.
- plot_graph.py - Plots the graph of "Numbers of Clusters"(X-axis) vs "Silhouette Score"(Y-axis) for comparing clustering performance.
- English Letters.csv - Labeled real world dataset containing 20,000 samples of 26 letters of English alphabets. Each sample is a 16 dimensional encoding, and there 26 different class labels corresponding to 26 letters of the English alphabet. Class labels(letter names) are in same file.
- moons_data.csv - Labeled dataset datset containing 25 half moon pairs, that is 50 clusters corresponding to 50 individual half moons. Any number of half moon pairs can be generated using gen_moonpairs.py program.
- groundtruth.csv - Contains class labels for data points in moon_data.csv.
- Nachiket Bhide - bhiden@uwindsor.ca
- Dr. Luis Rueda - lrueda@uwindsor.ca