Crash from ParallelMap when both partition and memory are auto (default) #370

KatharineShapcott · 2022-10-31T09:48:38Z

Describe the bug
In computational_routine.compute the default values of partition and mem_per_job are both 'auto' which causes a crash in acme.

To Reproduce
Steps to reproduce the behavior:

Load any dataset
Perform some preprocessing on it with parallel=True e.g. filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 900], filter_type='bp', direction = 'twopass', parallel=True, rectify = True, chan_per_worker=1)

Expected behavior
I would have thought it tries to guess the partition itself with those settings but there's no ability to do that in ACME without a mem_per_job being supplied.

System Profile:

OS: ESI cluster
Please paste the output of the following command here
```
 >>> SyNCopy v. 2022.8.1a0 <<< 
```

Created: Mon Oct 31 10:08:12 2022

System Profile:
3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0]
ACME: 0.21
Dask: 2021.10.0
NumPy: 1.21.5
SciPy: 1.7.3

The text was updated successfully, but these errors were encountered:

tensionhead · 2022-10-31T09:55:05Z

Thanks for filing @KatharineShapcott!
As you point out yourself right away, this seems to be a problem/bug with ACME itself. Please reach out to @pantaray directly and maybe best re-file this issue at the acme repository.

EDIT: Another user mentioned in #369 that with (the quite greedy) setting chan_per_worker=1 acme crashes, maybe try sth more lenient like chan_per_worker=10?!

KatharineShapcott · 2022-10-31T10:00:29Z

Hi @tensionhead, I think the problem is that the settings syncopy are giving into ACME do not exist, right? There is no partition='auto' setting.

tensionhead · 2022-10-31T10:04:10Z

Mmh, that is the default internal behaviour of Syncopy apparently, and that never gave any problems before afaik. We did not change anything in this regard. Again, I think @pantaray is the best suited to address this problem. If the acme API somehow changed (and we missed that entirely with our tests for some reasons), then we need to take action.

pantaray · 2022-10-31T10:22:43Z

Hi all! ACME version 2022.8 changed the default behavior when using partition="auto": a memory-estimation dry-run is launched to determine the "best" partition for the workload at hand. However, according to your system profile you're using ACME 0.21, so the "auto" setting should just fall back to "8GBXS" (which may or may not be a good choice for your data).
A simple workaround is to first launch a dask client manually and then perform the computation:

spyClient = spy.esi_cluster_setup(partition="8GBL", n_jobs=10)
filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 900], filter_type='bp', direction = 'twopass', rectify = True, chan_per_worker=1)

With this approach the parallel=True setting is not necessary, since syncopy picks up any running distributed computing clients automatically.

KatharineShapcott · 2022-10-31T10:28:30Z

Actually I get this behaviour because the spyClient fails to find any jobs and then doesn't exist. But interesting that parallel True isn't necessary anymore, that would save me from having this problem.

It sounds like syncopy depends on ACME 2022.8, should I upgrade?

pantaray · 2022-10-31T10:33:08Z

Actually I get this behaviour because the spyClient fails to find any jobs and then doesn't exist.

This is caused by a dask bug which can be avoided by manually pinning click. In your environment, please try running conda install "click < 8.1"

It sounds like syncopy depends on ACME 2022.8, should I upgrade?

Probably not, since the new options in ACME 2022.8 are not yet ported to Syncopy.

KatharineShapcott · 2022-10-31T11:26:14Z

I don't think it's a bug, it's happening if the cluster is too full and no jobs can be allocated in the available time. My version of click is click 8.0.4 so I think I should be fine right?

pantaray · 2022-10-31T11:48:30Z

Ah, got it.

My version of click is click 8.0.4 so I think I should be fine right?

Yes, that's right. If the cluster is busy you can try playing around with n_jobs_startup and timeout in esi_cluster_setup to get at least a few jobs started for the computation (the scheduler is smart enough to integrate new jobs into the computing client as they come online).

tensionhead · 2022-10-31T13:09:52Z

So this should not be a problem coming directly from Syncopy? The SPYValueError is hard for me to trace from that output though, but if I understood @pantaray correctly, ParallelMap supportspartition='auto', also for the newest ACME 2022.08?! The error @KatharineShapcott reported then still does not make entirely sense to me I am afraid. Is it related to the ESI cluster state?

pantaray · 2022-10-31T13:58:29Z

I'm not sure what's the problem, to be honest. The version of ACME installed in the environment uses partition="8GBXS" as default setting in esi_cluster_setup. If a Syncopy meta-function is launched with parallel=True, then the parallel_client_detector invokes esi_cluster_setup without specifying a partition, i.e., "8GBXS" should be used:

syncopy/syncopy/shared/kwarg_decorators.py

Line 487 in a997b17

client = esi_cluster_setup(n_jobs=nTrials, interactive=False)

I'm not sure where the partition="auto" input comes from. Did you maybe have a another client running in the background @KatharineShapcott ?

KatharineShapcott · 2022-10-31T14:29:54Z

Here's a minimal example

>>> import syncopy as spy
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/dask_jobqueue/core.py:20: FutureWarning: tmpfile is deprecated and will be removed in a future release. Please use dask.utils.tmpfile instead.
  from distributed.utils import tmpfile
>>> spy_filename = '/mnt/hpc_slurm/projects/OWzeronoise/Jove_paper/369/20221024/JVE/002/Jove_paper-369-20221024-JVE-002.spy'
>>> out = spy.load(spy_filename, tag = 'raw')
>>> filtered = spy.preprocessing(out, filter_class='but', order = 4, freq=[600, 9000], filter_type='bp', direction = 'twopass', parallel=True, chan_per_worker=1, rectify=False)
Syncopy <ACME: ParallelMap> WARNING: Cluster node esi-svhpc3 not recognized. Falling back to vanilla SLURM setup allocating one worker and one core per job

SyNCoPy encountered an error in 

<stdin>, line 1 in <module>


--------------------------------------------------------------------------------
Abbreviated traceback:

/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 261 in wrapper_cfg
        return func(*data, *posargs, **cfg)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 378 in wrapper_select
        res = func(*args, **kwargs)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py, line 495 in parallel_client_detector
        results = func(*args, **kwargs)
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/preproc/preprocessing.py, line 367 in preprocessing
        filterMethod.compute(
/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/computational_routine.py, line 652 in compute
        self.pmap = ParallelMap(self.computeFunction,

Use `import traceback; import sys; traceback.print_tb(sys.last_traceback)` for full error traceback.
SPYValueError: Invalid value of `partition`: 'auto'; expected '48GBS' or 'GPUshort' or '96GB' or '16GBXS' or '24GBXL' or '32GBXS' or '64GBXS' or '8GBDEV' or '8GBVIS' or '16GBS' or '64GBL' or '16GBXL' or '16GBDEV' or '16GBXXL' or '24GBS' or '48GBXL' or '24GBVIS' or 'ESI*' or '48GBL' or '16GBL' or '96GBXL' or '8GBXXL' or '24GBXS' or '64GB' or '16GB' or '32GBL' or '8GBS' or '96GBL' or 'GPUtest' or 'E880' or '32GBXXL' or '64GBXL' or 'PREPO' or '96GBXXL' or '64GBS' or '16GBVIS' or '24GBL' or '32GBS' or '24GBXXL' or 'GPUlong' or '24GB' or '32GB' or '48GBXS' or '48GB' or '64GBVIS' or '96GBS' or '8GB' or '96GBXS' or '8GBXS' or '32GBVIS' or '48GBXXL' or '8GBXL' or '48GBVIS' or '8GBL' or '32GBXL' or '96GBVIS' 
>>> spy.cluster_cleanup()
Syncopy </cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/dask_helpers.py> WARNING: No dangling clients or clusters found.
>>>

KatharineShapcott · 2022-10-31T14:31:48Z

Seems to happen if parallel is True but there's no pre-existing esi_cluster_setup

KatharineShapcott · 2022-10-31T14:33:15Z

I thought the culprit was this:

syncopy/syncopy/shared/computational_routine.py

Line 657 in ddb4722

partition="auto",

tensionhead · 2022-10-31T16:13:25Z

Mmh, so that's probably why we are all confused.. this paricular line 657 was not changed since ages, and according to @pantaray the keyword value 'auto' was and still is valid for ACME. I can also not reproduce directly here on my local machine, which should be the case if it is a simple ValueError, hinting that it has something to do with the ESI cluster integration?! @KatharineShapcott could you post the full traceback, there are instuctions printed right within the Python error message.. even though it says SPYValueError I really think the error originates from within acme.. thx!

EDIT: as you can see here, partition='auto' is even the default for the ParallelMap constructor 🤷‍♂️

KatharineShapcott · 2022-10-31T16:21:36Z

But do any of your tests use parallel=True without having first run esi_cluster_setup?

KatharineShapcott · 2022-10-31T16:22:21Z

import traceback; import sys; traceback.print_tb(sys.last_traceback)
  File "<stdin>", line 1, in <module>
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 261, in wrapper_cfg
    return func(*data, *posargs, **cfg)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 378, in wrapper_select
    res = func(*args, **kwargs)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/kwarg_decorators.py", line 495, in parallel_client_detector
    results = func(*args, **kwargs)
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/preproc/preprocessing.py", line 367, in preprocessing
    filterMethod.compute(
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/syncopy/shared/computational_routine.py", line 652, in compute
    self.pmap = ParallelMap(self.computeFunction,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/frontend.py", line 483, in __init__
    self.daemon = ACMEdaemon(self,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/backend.py", line 179, in __init__
    self.prepare_client(n_jobs=n_jobs,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/backend.py", line 441, in prepare_client
    self.client = slurm_cluster_setup(partition=partition,
  File "/cs/departmentN5/conda_envs/oesyncopy/lib/python3.8/site-packages/acme/dask_helpers.py", line 320, in slurm_cluster_setup
    raise customValueError(legal=lgl if isSpyModule else msg.format(funcName, str(partition), lgl),

tensionhead · 2022-10-31T16:56:38Z

The traceback points into acme internals.. would you mind taking this discussion to the respective acme issue?
Regarding the local test I did: I only use with parallel=True, which spawns a local dask cluster.. and there partition='auto' is apparently no problem.

tensionhead · 2022-10-31T16:57:57Z

But do any of your tests use parallel=True without having first run esi_cluster_setup?

our automated tests don't call esi_cluster_setup explicitly afaik

pantaray · 2022-10-31T18:24:12Z

Thanks for posting the full traceback, @KatharineShapcott ! Now I understand, what's going on. The key message is:

Syncopy <ACME: ParallelMap> WARNING: Cluster node esi-svhpc3 not recognized. Falling back to vanilla SLURM setup allocating one worker and one core per job

This is a bug in this version of ACME that has been fixed since, cf. esi-neuroscience/acme#42

You could try to simply update ACME using conda's --no-deps flag, i.e., conda update --no-deps esi-acme (otherwise conda will complain that syncopy is not compatible with the newer ACME version)

pantaray · 2022-10-31T18:25:22Z

But do any of your tests use parallel=True without having first run esi_cluster_setup?

our automated tests don't call esi_cluster_setup explicitly afaik

In the test setup (before any tests are actually run), esi_cluster_setup is called to allocate a client:

syncopy/syncopy/tests/conftest.py

Line 27 in a997b17

cluster = esi_cluster_setup(partition="8GB", n_jobs=10,

And yes, for local clusters, the partition keyword has no effect.

KatharineShapcott · 2022-11-01T10:08:34Z

Great thanks, that makes sense now! I fixed it :)

KatharineShapcott mentioned this issue Oct 31, 2022

Error in syncopy, due to acme? esi-neuroscience/acme#44

Closed

tensionhead closed this as completed Oct 31, 2022

tensionhead closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash from ParallelMap when both partition and memory are auto (default) #370

Crash from ParallelMap when both partition and memory are auto (default) #370

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022 •

edited

Loading

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

pantaray commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022 •

edited

Loading

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Nov 1, 2022

Crash from ParallelMap when both partition and memory are auto (default) #370

Crash from ParallelMap when both partition and memory are auto (default) #370

Comments

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022 • edited Loading

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

pantaray commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022 • edited Loading

KatharineShapcott commented Oct 31, 2022

KatharineShapcott commented Oct 31, 2022

tensionhead commented Oct 31, 2022

tensionhead commented Oct 31, 2022

pantaray commented Oct 31, 2022

pantaray commented Oct 31, 2022

KatharineShapcott commented Nov 1, 2022

tensionhead commented Oct 31, 2022 •

edited

Loading

tensionhead commented Oct 31, 2022 •

edited

Loading