Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DenseClus Implementation notebook for jumpstart #60

Merged
merged 17 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions denseclus/DenseClus.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,15 +593,15 @@ def predict(self, df_new: pd.DataFrame) -> np.ndarray:
)
return predictions

def evaluate(self) -> np.array:
def evaluate(self, log_dbcv=False) -> np.array:
"""Evaluates the cluster and returns the cluster assigned to each row.

This is a wrapper function for HDBSCAN. It outputs the cluster labels
that HDBSCAN converged on.

Parameters
----------
None : None
log_dbcv (bool) : Whether to log DBCV scores. Defaults to False

Returns
-------
Expand All @@ -612,14 +612,18 @@ def evaluate(self) -> np.array:
clustered = labels >= 0

if isinstance(self.hdbscan_, dict) or self.umap_combine_method == "ensemble":
print(f"DBCV score {self.hdbscan_['hdb_numerical'].relative_validity_}")
print(f"DBCV score {self.hdbscan_['hdb_categorical'].relative_validity_}")
if log_dbcv:
print(f"DBCV numerical score {self.hdbscan_['hdb_numerical'].relative_validity_}")
print(
f"DBCV categorical score {self.hdbscan_['hdb_categorical'].relative_validity_}"
)
embedding_len = self.numerical_umap_.embedding_.shape[0]
coverage = np.sum(clustered) / embedding_len
print(f"Coverage {coverage}")
return labels

print(f"DBCV score {self.hdbscan_.relative_validity_}")
if log_dbcv:
print(f"DBCV score {self.hdbscan_.relative_validity_}")
embedding_len = self.mapper_.embedding_.shape[0]
coverage = np.sum(clustered) / embedding_len
print(f"Coverage {coverage}")
Expand Down
2 changes: 2 additions & 0 deletions notebooks/02_TuningWithHDBSCAN.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,8 @@
],
"source": [
"# we will make our own scorer for DBCV\n",
"\n",
"\n",
"def dbcv_score(X, labels):\n",
" return validity_index(X, labels)\n",
"\n",
Expand Down
2,191 changes: 2,191 additions & 0 deletions notebooks/DenseClusImplentation.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI can't leave comments on specific lines because diff is too big (and github vs code extension doesn't support commenting on notebooks microsoft/vscode-pull-request-github#3462). Will try and be descriptive in comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment explaining why we only take native_country = " United-States"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a markdown description about what/why we are doing in "Create UMAP embeddings & Fit HdbScan for Numerical and Categorical features separately" section (ie "a seemingly straightforward approach may be to try clustering numerical and categorical features separately. lets use this as a baseline to compare against.. ")

also include brief overview of what UMAP and HDBSCAN are - can probably pull this from other notebooks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts about baseline separate numerical/categorical cluster analysis:

  • cluster results don't look super meaningful, could this be improved with hyperparameter optimization? this may be too much for this notebook though, especially considering this is just supposed to be a baseline and we get reasonable denseclus results. I'm open either way here, any thoughts?
  • in the select_dtypes line why are we dropping segment then adding it back in the next line?
  • can we expand the analysis to look at more than just mean? I think other descriptive stats might help with the story telling (but understand cluster quality is not good so there isnt much of a story to tell)
  • can we see the columns used for categorical clustering?
  • categorical analysis points 2 and 3 seem to be conflicting: we are saying both that there is a small finite space where we can have points + we have a large sparse space

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update all of the plots to have appropriate x/y axis labels (or remove the labels) instead of None

momonga-ml marked this conversation as resolved.
Show resolved Hide resolved
momonga-ml marked this conversation as resolved.
Show resolved Hide resolved
momonga-ml marked this conversation as resolved.
Show resolved Hide resolved
momonga-ml marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
black==23.11.0
coverage==7.3.2
mypy==1.7.0
nbqa==1.7.0
mypy==1.7.1
nbqa==1.7.1
pre-commit==3.5.0
pylint==3.0.2
pytest==7.4.3
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ numpy>=1.20.2
hdbscan>=0.8.27
numba>=0.51.2
pandas>=1.2.4
scikit_learn>=0.24.2
scikit_learn>=0.24.2
seaborn>=0.13.0
Loading