Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lets umap run in parallel #3295

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

lets umap run in parallel #3295

wants to merge 6 commits into from

Conversation

Intron7
Copy link
Member

@Intron7 Intron7 commented Oct 18, 2024

Closes #3211

Lets you run all parts of UMAP in Parallel. Default will still be False for reproducibility.
Benchmarks (95k Cells AMD5950X)
parallel = False 33 s
parallel = True 18 s

I copied the doc string from UMAP to explain the impact of parallel execution.

@Intron7
Copy link
Member Author

Intron7 commented Oct 18, 2024

I didn't add release note yet because I'm not sure if this should already go into 1.10.4

Copy link

codecov bot commented Oct 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.04%. Comparing base (bbcd4b1) to head (4eba4a8).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3295      +/-   ##
==========================================
+ Coverage   76.95%   77.04%   +0.09%     
==========================================
  Files         109      110       +1     
  Lines       12465    12494      +29     
==========================================
+ Hits         9592     9626      +34     
+ Misses       2873     2868       -5     
Files with missing lines Coverage Δ
src/scanpy/tools/_umap.py 74.19% <100.00%> (+2.52%) ⬆️

... and 4 files with indirect coverage changes

@@ -146,6 +147,8 @@ def umap(
:attr:`~anndata.AnnData.obsp`\\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities.
copy
Return a copy instead of writing to adata.
parallel
Copy link
Contributor

@ilan-gold ilan-gold Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested? It seems like if the underlying implementation would error out every time a random seed is set, and we set a random seed, then setting this argument as true won't work?

I don't know where I got that this should error, but then we should definitely warn users about this sort of thing. Reproducibility is very important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand:

sc.tl.umap(adata, parallel=True, random_state=42)

works so I think this needs a test + updated comment to reflect whatever is supposed to be going on here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As does

sc.tl.umap(adata, parallel=True, random_state=np.random.RandomState(42))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to see if it errors (it doesn't)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we should also warn users about this random state business. And check that the warning is raised every time parallel is True because this is very important!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! But then we should definitely warn in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok there is in issue with check_random_state as it transforms None into a seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the function is bugged on the umap side. If you force None as the random seed, it errors. I'll make an issue with umap for clarification. I'll but the PR on ice while i wait for an update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -56,6 +56,7 @@ def umap(
key_added: str | None = None,
neighbors_key: str = "neighbors",
copy: bool = False,
parallel: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the n_jobs: int | None convention for this (with None meaning sc.settings.N_JOBS), but simplicial_set_embedding just passes parallel on to numba.

We should think about how the two integrate before we add this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is numba parallel, we can only set it to use everything you got or nothing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s why we should talk about the parameter name. IIRC that would be the first parallelization parameter not called n_jobs.

Comment on lines +219 to +224
if parallel and random_state is not None:
warnings.warn(
"Parallel execution was expected to be disabled when both `parallel=True` and `random_state` are set, "
"to ensure reproducibility. However, parallel execution still seems to occur, which may lead to "
"non-deterministic results."
)
Copy link
Contributor

@ilan-gold ilan-gold Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the umap repo to give user more context. Something like "UMAP reports that parallel execution should error with a random seed to ensure reproducibility. Parallel execution may occur nonetheless, which can lead to non-deterministic results." with a link to their docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then make sure this warning is raised in tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this once i have clarification from lmcinnes/umap#1155

Comment on lines 93 to 96
def test_umap_parallel_randomstate():
pbmc = pbmc68k_reduced()[:100, :].copy()
sc.tl.umap(pbmc, parallel=True, random_state=42)
sc.tl.umap(pbmc, parallel=True, random_state=np.random.RandomState(42))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one action per test. This should be parametrized. Also, we don't need both of these given what I said above. We should just check for the warning with random_state as None or not-None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I updated the test already to catch the warning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Umap parallel
3 participants