Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lets umap run in parallel #3295

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/scanpy/tools/_umap.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def umap(
key_added: str | None = None,
neighbors_key: str = "neighbors",
copy: bool = False,
parallel: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the n_jobs: int | None convention for this (with None meaning sc.settings.N_JOBS), but simplicial_set_embedding just passes parallel on to numba.

We should think about how the two integrate before we add this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is numba parallel, we can only set it to use everything you got or nothing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s why we should talk about the parameter name. IIRC that would be the first parallelization parameter not called n_jobs.

) -> AnnData | None:
"""\
Embed the neighborhood graph using UMAP :cite:p:`McInnes2018`.
Expand Down Expand Up @@ -146,6 +147,8 @@ def umap(
:attr:`~anndata.AnnData.obsp`\\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities.
copy
Return a copy instead of writing to adata.
parallel
Copy link
Contributor

@ilan-gold ilan-gold Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested? It seems like if the underlying implementation would error out every time a random seed is set, and we set a random seed, then setting this argument as true won't work?

I don't know where I got that this should error, but then we should definitely warn users about this sort of thing. Reproducibility is very important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand:

sc.tl.umap(adata, parallel=True, random_state=42)

works so I think this needs a test + updated comment to reflect whatever is supposed to be going on here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As does

sc.tl.umap(adata, parallel=True, random_state=np.random.RandomState(42))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to see if it errors (it doesn't)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we should also warn users about this random state business. And check that the warning is raised every time parallel is True because this is very important!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! But then we should definitely warn in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok there is in issue with check_random_state as it transforms None into a seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the function is bugged on the umap side. If you force None as the random seed, it errors. I'll make an issue with umap for clarification. I'll but the PR on ice while i wait for an update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to run the computation using numba parallel. Running in parallel is non-deterministic, and is not used if a random seed has been set, to ensure reproducibility.

Returns
-------
Expand Down Expand Up @@ -232,6 +235,7 @@ def umap(
densmap_kwds={},
output_dens=False,
verbose=settings.verbosity > 3,
parallel=parallel,
)
elif method == "rapids":
msg = (
Expand Down
Loading