Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Available modules differ between management and compute nodes #18

Open
joeiznogood opened this issue Apr 24, 2019 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@joeiznogood
Copy link

I am trying to run some MPI jobs on my CitC. However, the available MPI modules are different on the management and compute nodes. I have compiled my code on the shared filesystem logged in on the management node, using the module mpi/openmpi-x86_64. However, when I then tried to load it on the compute nodes (as part of my job script), it told me that it did not exist.

Below are the listed available modules on the management node:

[joealex@mgmt run_test]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------
dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------
mpi/mpich-3.2-x86_64 mpi/openmpi3-x86_64 mpi/openmpi-x86_64
And the compute node:
[opc@vm-standard2-2-ad1-0001 ~]$ module avail

--------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------
dot module-git module-info modules null use.own

---------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------
mpi/mpich-3.0-x86_64 mpi/mpich-x86_64 mpi/openmpi3-x86_64

There is only one that overlaps...

@christopheredsall christopheredsall self-assigned this Apr 24, 2019
@christopheredsall christopheredsall added the bug Something isn't working label Apr 24, 2019
@christopheredsall
Copy link
Contributor

Hi Joe,

Thanks for reporting that. It's an odd one, because we use the OS package manager to install the MPIs and we don't specify a version, just the name (Here's the commit that does that clusterinthecloud/ansible@3408460 )

One difference between the management and compute nodes is that the *-devel packages are installed on the management node and only the runtime ones on the compute nodes. That could be where the divergence is coming in.

Since Oracle release regular updates to the OS images, and these are configured in both the terraform (for mgmt) and ansible (for compute) repos, there is a chance they might get out of sync. But that isn't the case here, they're currently both on the Feb 20 release:

Node Image Source
mgmt Oracle-Linux-7.6-2019.02.20-0 https://github.com/ACRC/oci-cluster-terraform/blob/827d73d5f4ef3ae6d7d6e4f071a6d6f20cb1d7d7/variables.tf#L36-L41
compute Oracle-Linux-7.6-2019.02.20-0 https://github.com/ACRC/slurm-ansible-playbook/blob/71947683f6a1a0da3d299fe65ea3200665589247/roles/slurm/files/citc_oci.py#L141-L146

We're still looking in to this. One option would be install a non OS MPI package with EasyBuild or spack - I'll see if I can come up with a tested workaround.

For "newer" MPIs we need a rebuild of the Slurm packages with PMIx support (clusterinthecloud/ansible#24). @milliams is working on this (clusterinthecloud/ansible#27) however he is away this week and I think because of the change we reverted in #17 we can't just point at the slurm18 branch in terrafrom.tfvars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants