Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash from ParallelMap when both partition and memory are auto (default) #370

Closed
KatharineShapcott opened this issue Oct 31, 2022 · 21 comments

Comments

@KatharineShapcott
Copy link
Contributor

Describe the bug
In computational_routine.compute the default values of partition and mem_per_job are both 'auto' which causes a crash in acme.

To Reproduce
Steps to reproduce the behavior:

  1. Load any dataset
  2. Perform some preprocessing on it with parallel=True e.g. filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 900], filter_type='bp', direction = 'twopass', parallel=True, rectify = True, chan_per_worker=1)

image (1)

Expected behavior
I would have thought it tries to guess the partition itself with those settings but there's no ability to do that in ACME without a mem_per_job being supplied.

System Profile:

  • OS: ESI cluster

  • Please paste the output of the following command here

     >>> SyNCopy v. 2022.8.1a0 <<< 
    

Created: Mon Oct 31 10:08:12 2022

System Profile:
3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0]
ACME: 0.21
Dask: 2021.10.0
NumPy: 1.21.5
SciPy: 1.7.3

@tensionhead
Copy link
Contributor

tensionhead commented Oct 31, 2022

Thanks for filing @KatharineShapcott!
As you point out yourself right away, this seems to be a problem/bug with ACME itself. Please reach out to @pantaray directly and maybe best re-file this issue at the acme repository.

EDIT: Another user mentioned in #369 that with (the quite greedy) setting chan_per_worker=1 acme crashes, maybe try sth more lenient like chan_per_worker=10?!

@KatharineShapcott
Copy link
Contributor Author

Hi @tensionhead, I think the problem is that the settings syncopy are giving into ACME do not exist, right? There is no partition='auto' setting.

@tensionhead
Copy link
Contributor

Mmh, that is the default internal behaviour of Syncopy apparently, and that never gave any problems before afaik. We did not change anything in this regard. Again, I think @pantaray is the best suited to address this problem. If the acme API somehow changed (and we missed that entirely with our tests for some reasons), then we need to take action.

@pantaray
Copy link
Member

Hi all! ACME version 2022.8 changed the default behavior when using partition="auto": a memory-estimation dry-run is launched to determine the "best" partition for the workload at hand. However, according to your system profile you're using ACME 0.21, so the "auto" setting should just fall back to "8GBXS" (which may or may not be a good choice for your data).
A simple workaround is to first launch a dask client manually and then perform the computation:

spyClient = spy.esi_cluster_setup(partition="8GBL", n_jobs=10)
filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 900], filter_type='bp', direction = 'twopass', rectify = True, chan_per_worker=1)

With this approach the parallel=True setting is not necessary, since syncopy picks up any running distributed computing clients automatically.

@KatharineShapcott
Copy link
Contributor Author

Actually I get this behaviour because the spyClient fails to find any jobs and then doesn't exist. But interesting that parallel True isn't necessary anymore, that would save me from having this problem.

It sounds like syncopy depends on ACME 2022.8, should I upgrade?

@pantaray
Copy link
Member

Actually I get this behaviour because the spyClient fails to find any jobs and then doesn't exist.

This is caused by a dask bug which can be avoided by manually pinning click. In your environment, please try running conda install "click < 8.1"

It sounds like syncopy depends on ACME 2022.8, should I upgrade?

Probably not, since the new options in ACME 2022.8 are not yet ported to Syncopy.

@KatharineShapcott
Copy link
Contributor Author

I don't think it's a bug, it's happening if the cluster is too full and no jobs can be allocated in the available time. My version of click is click 8.0.4 so I think I should be fine right?

@pantaray
Copy link
Member

Ah, got it.

My version of click is click 8.0.4 so I think I should be fine right?

Yes, that's right. If the cluster is busy you can try playing around with n_jobs_startup and timeout in esi_cluster_setup to get at least a few jobs started for the computation (the scheduler is smart enough to integrate new jobs into the computing client as they come online).

@tensionhead
Copy link
Contributor

So this should not be a problem coming directly from Syncopy? The SPYValueError is hard for me to trace from that output though, but if I understood @pantaray correctly, ParallelMap supportspartition='auto', also for the newest ACME 2022.08?! The error @KatharineShapcott reported then still does not make entirely sense to me I am afraid. Is it related to the ESI cluster state?

@pantaray
Copy link
Member

I'm not sure what's the problem, to be honest. The version of ACME installed in the environment uses partition="8GBXS" as default setting in esi_cluster_setup. If a Syncopy meta-function is launched with parallel=True, then the parallel_client_detector invokes esi_cluster_setup without specifying a partition, i.e., "8GBXS" should be used:

client = esi_cluster_setup(n_jobs=nTrials, interactive=False)

I'm not sure where the partition="auto" input comes from. Did you maybe have a another client running in the background @KatharineShapcott ?

@KatharineShapcott
Copy link
Contributor Author

Here's a minimal example

>>> import syncopy as spy
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/dask_jobqueue/core.py:20: FutureWarning: tmpfile is deprecated and will be removed in a future release. Please use dask.utils.tmpfile instead.
  from distributed.utils import tmpfile
>>> spy_filename = '/mnt/hpc_slurm/projects/OWzeronoise/Jove_paper/369/20221024/JVE/002/Jove_paper-369-20221024-JVE-002.spy'
>>> out = spy.load(spy_filename, tag = 'raw')
>>> filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 9000], filter_type='bp', direction = 'twopass', parallel=True, chan_per_worker=1, rectify=False)
Syncopy <ACME: ParallelMap> WARNING: Cluster node esi-svhpc3 not recognized. Falling back to vanilla SLURM setup allocating one worker and one core per job

SyNCoPy encountered an error in 

<stdin>, line 1 in <module>


--------------------------------------------------------------------------------
Abbreviated traceback:

/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 261 in wrapper_cfg
        return func(*data, *posargs, **cfg)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 378 in wrapper_select
        res = func(*args, **kwargs)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 495 in parallel_client_detector
        results = func(*args, **kwargs)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/preproc/preprocessing.py, line 367 in preprocessing
        filterMethod.compute(
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/computational_routine.py, line 652 in compute
        self.pmap = ParallelMap(self.computeFunction,

Use `import traceback; import sys; traceback.print_tb(sys.last_traceback)` for full error traceback.
SPYValueError: Invalid value of `partition`: 'auto'; expected '48GBS' or 'GPUshort' or '96GB' or '16GBXS' or '24GBXL' or '32GBXS' or '64GBXS' or '8GBDEV' or '8GBVIS' or '16GBS' or '64GBL' or '16GBXL' or '16GBDEV' or '16GBXXL' or '24GBS' or '48GBXL' or '24GBVIS' or 'ESI*' or '48GBL' or '16GBL' or '96GBXL' or '8GBXXL' or '24GBXS' or '64GB' or '16GB' or '32GBL' or '8GBS' or '96GBL' or 'GPUtest' or 'E880' or '32GBXXL' or '64GBXL' or 'PREPO' or '96GBXXL' or '64GBS' or '16GBVIS' or '24GBL' or '32GBS' or '24GBXXL' or 'GPUlong' or '24GB' or '32GB' or '48GBXS' or '48GB' or '64GBVIS' or '96GBS' or '8GB' or '96GBXS' or '8GBXS' or '32GBVIS' or '48GBXXL' or '8GBXL' or '48GBVIS' or '8GBL' or '32GBXL' or '96GBVIS' 
>>> spy.cluster_cleanup()
Syncopy </cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/dask_helpers.py> WARNING: No dangling clients or clusters found.
>>> 

@KatharineShapcott
Copy link
Contributor Author

Seems to happen if parallel is True but there's no pre-existing esi_cluster_setup

@KatharineShapcott
Copy link
Contributor Author

I thought the culprit was this:

@tensionhead
Copy link
Contributor

tensionhead commented Oct 31, 2022

Mmh, so that's probably why we are all confused.. this paricular line 657 was not changed since ages, and according to @pantaray the keyword value 'auto' was and still is valid for ACME. I can also not reproduce directly here on my local machine, which should be the case if it is a simple ValueError, hinting that it has something to do with the ESI cluster integration?! @KatharineShapcott could you post the full traceback, there are instuctions printed right within the Python error message.. even though it says SPYValueError I really think the error originates from within acme.. thx!

EDIT: as you can see here, partition='auto' is even the default for the ParallelMap constructor 🤷‍♂️

@KatharineShapcott
Copy link
Contributor Author

But do any of your tests use parallel=True without having first run esi_cluster_setup?

@KatharineShapcott
Copy link
Contributor Author

import traceback; import sys; traceback.print_tb(sys.last_traceback)
  File "<stdin>", line 1, in <module>
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 261, in wrapper_cfg
    return func(*data, *posargs, **cfg)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 378, in wrapper_select
    res = func(*args, **kwargs)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 495, in parallel_client_detector
    results = func(*args, **kwargs)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/preproc/preprocessing.py", line 367, in preprocessing
    filterMethod.compute(
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/computational_routine.py", line 652, in compute
    self.pmap = ParallelMap(self.computeFunction,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/frontend.py", line 483, in __init__
    self.daemon = ACMEdaemon(self,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/backend.py", line 179, in __init__
    self.prepare_client(n_jobs=n_jobs,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/backend.py", line 441, in prepare_client
    self.client = slurm_cluster_setup(partition=partition,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/dask_helpers.py", line 320, in slurm_cluster_setup
    raise customValueError(legal=lgl if isSpyModule else msg.format(funcName, str(partition), lgl),

@tensionhead
Copy link
Contributor

The traceback points into acme internals.. would you mind taking this discussion to the respective acme issue?
Regarding the local test I did: I only use with parallel=True, which spawns a local dask cluster.. and there partition='auto' is apparently no problem.

@tensionhead
Copy link
Contributor

But do any of your tests use parallel=True without having first run esi_cluster_setup?

our automated tests don't call esi_cluster_setup explicitly afaik

@tensionhead tensionhead closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2022
@pantaray
Copy link
Member

Thanks for posting the full traceback, @KatharineShapcott ! Now I understand, what's going on. The key message is:

Syncopy <ACME: ParallelMap> WARNING: Cluster node esi-svhpc3 not recognized. Falling back to vanilla SLURM setup allocating one worker and one core per job

This is a bug in this version of ACME that has been fixed since, cf. esi-neuroscience/acme#42

You could try to simply update ACME using conda's --no-deps flag, i.e., conda update --no-deps esi-acme (otherwise conda will complain that syncopy is not compatible with the newer ACME version)

@pantaray
Copy link
Member

But do any of your tests use parallel=True without having first run esi_cluster_setup?

our automated tests don't call esi_cluster_setup explicitly afaik

In the test setup (before any tests are actually run), esi_cluster_setup is called to allocate a client:

cluster = esi_cluster_setup(partition="8GB", n_jobs=10,

And yes, for local clusters, the partition keyword has no effect.

@KatharineShapcott
Copy link
Contributor Author

Great thanks, that makes sense now! I fixed it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants