You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run some MPI jobs on my CitC. However, the available MPI modules are different on the management and compute nodes. I have compiled my code on the shared filesystem logged in on the management node, using the module mpi/openmpi-x86_64. However, when I then tried to load it on the compute nodes (as part of my job script), it told me that it did not exist.
Below are the listed available modules on the management node:
Thanks for reporting that. It's an odd one, because we use the OS package manager to install the MPIs and we don't specify a version, just the name (Here's the commit that does that clusterinthecloud/ansible@3408460 )
One difference between the management and compute nodes is that the *-devel packages are installed on the management node and only the runtime ones on the compute nodes. That could be where the divergence is coming in.
Since Oracle release regular updates to the OS images, and these are configured in both the terraform (for mgmt) and ansible (for compute) repos, there is a chance they might get out of sync. But that isn't the case here, they're currently both on the Feb 20 release:
We're still looking in to this. One option would be install a non OS MPI package with EasyBuild or spack - I'll see if I can come up with a tested workaround.
For "newer" MPIs we need a rebuild of the Slurm packages with PMIx support (clusterinthecloud/ansible#24). @milliams is working on this (clusterinthecloud/ansible#27) however he is away this week and I think because of the change we reverted in #17 we can't just point at the slurm18 branch in terrafrom.tfvars
I am trying to run some MPI jobs on my CitC. However, the available MPI modules are different on the management and compute nodes. I have compiled my code on the shared filesystem logged in on the management node, using the module mpi/openmpi-x86_64. However, when I then tried to load it on the compute nodes (as part of my job script), it told me that it did not exist.
Below are the listed available modules on the management node:
There is only one that overlaps...
The text was updated successfully, but these errors were encountered: