This is a course Project for the University of Alberta course CMPUT 697, Fall 2019. This project aims to improve the clustering performance of HDBSCAN, a well-known hierarchical density-based clustering algorithm by automatically removing outliers.
We propose 6 different methods that leverage well-known algorithms to remove outliers from data automatically.
Experiments on simulated data demonstrate that one of these variants, consistently performs well in the automatic removal
of noise, thus improving the performance of HDBSCAN.
Details of the dataset For this task, 6 datasets were generated with ground truth values. Each dataset is 2D with various numbers of clusters, different densities, and distribution of noise. The following figures show a visual representation of the datasets and statistics about the data.
Results In the results section we look at each dataset individually by looking at the number of clusters discovered, the number of ground truth clusters, the number of mis-clustered points, the number of pruned inliers, etc. We also report two performance evaluation metrics, DBCV and ARI.
Dataset 1:
Dataset 2:
Dataset 3:
Dataset 4:
Dataset 5:
Dataset 6: